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11.4 References 


CHAPTER 
ONE 


PREFACE 


In recent years, with the development of technology, Optical Character Recognition (OCR) has been widely used in 
various scenarios. The text detection and recognition algorithms based on deep learning framework has been widely used 
in daily life, such as license plate recognition, bank card information recognition, ID card information recognition, train 
ticket information recognition, etc. In addition, general OCR technology is also widely used, such as content security 
monitoring, or combined with visual features to complete tasks like video comprehension and video search. 


On the one hand, we see the wide application space of OCR, but on the other hand, we also find that there are few books 
on the world that comprehensively introduce OCR from theory to practice, which leads to many algorithm engineers need 
to take many detours to get familiar with and understand this field. At the same time, in practical applications, especially 
in a wide range of general scenarios, OCR problems still need to face some challenges, such as affine transformation, 
scale problems, insufficient lighting, shooting blur and other technical difficulties; in addition, OCR applications are often 
docked to massive amounts of data, but require data to be processed in real time; and OCR applications are often deployed 
on mobile or embedded hardware, and the end-side has limited storage space and computational power, so there are 
high requirements on the size and prediction speed of OCR models. Sharing these contents, which are very crucial for 
practitioners, can undoubtedly accelerate the upgrading of the industry related to OCR and the industrial implementation 
of OCR deep learning technology. 


Based on the above motivation, around the key content of OCR industry application, pay tribute to the world renowned 
book “Dive into Deep Learning”, the book named “Dive into OCR”, through the joint creation of universities, enterprises, 
community developers, open source all the content and code of the book in Github, and provide a supporting video course 
for developers to learn to use. 


As the book is completed by many developers, the later editors try to unify the style as much as possible, but it is inevitable 
to miss one, if there are omissions and mistakes, readers are welcome to criticize and correct in the GitHub discussion, 
and you are also welcome to submit a Pull Request directly to revise and participate in the common construction. 


1.1 About the Book 


The workflow of this book is also in the form of GitHub submission and maintenance code + jupyter notebook integration 
code, formulas and images, containing the latest methods and applications of OCR, and constantly updated; in terms of 
content, it mainly introduces deep learning technology based on and considers the feasibility of practical applications; in 
terms of content, we dare not claim that this is a rigorous textbook, but it is indeed a vivid tutorial with executable code 
that can help developers quickly implement OCR projects. 


We strongly believe in the importance of hands-on learning for deep learning, and we also present as much as possible how 
to implement a given method by code, as well as explanations of the ideas and implementation details of the algorithm 
design. This book, not only for OCR beginners to quickly understand the basic concepts and some key algorithms in OCR 
area, but also for algorithm engineers who want to quickly get their projects off the ground based on example code and 
inference and deployment programs. 


About this project, we hope to achieve the following goals. 
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. Free online access to code source files for all. 


. Use this book as a bridge to attract OCR researchers to build and share, publish their recent research results, expand 


the technical breadth as much as possible, and become a scientific research overview textbook for OCR technology 
books; 


. All algorithms contain executable code, showing OCR algorithm engineers how to solve problems in practice, and 


becoming a tutorial document for OCR industry implementation. 


. The book content is co-built and shared by the entire community, with constantly update iterations, keeping pace 


with the still rapidly developing field of deep learning. 


. Questions and answers about technical details can be discussed in PaddleOCR’s GitHub issues and discussion, 


allowing people to answer each other’s questions and exchange experiences. 


1.1.1 Content and Structure 


1.Preface 
2.Course Prerequisites 


2] 


J 


The first part is the preface and preliminary knowledge of the book, including the knowledge index and resource 
links needed in the process of using the book. 


The second part of the book, chapters 3-7, introduces the concepts, applications and industrial practices related 
to the core detection and recognition capabilities of OCR. In “Introduction to OCR Technology”, we explain in 
general the application scenarios and challenges of OCR, the basic concepts of the technology, and the pain points 
in industrial applications. Then two basic tasks of OCR are introduced in the chapters “Text Detection” and “Text 
Recognition”, and each chapter is accompanied by an algorithm explanation to code details and practical exercises. 
Chapters 6 and 7 are about the detailed introduction of PP-OCR series models. PP-OCR is an OCR system for 
industrial applications, based on the detection and recognition models through a series of optimization strategies to 
achieve a General industrial SOTA models, and at the same time to support a variety of inference and deployment 
solutions to enable enterprises to quickly implement OCR applications. 


The third part of the book, chapters 8-11, introduces applications beyond the two-stage OCR engine, including 
data synthesis, pre-processing algorithms, and end-to-end models, focusing on the OCR’s capabilities in document 


Chapter 1. Preface 
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scenarios, including layout analysis, table recognition, and visual document Q&A, again through a combination of 
algorithms and code that enables readers to understand and apply them in depth. 


1.1.2 Intended Audience 


This book is for students, researchers and engineers who wish to learn and apply OCR knowledge in depth. It is a practical 
book in the OCR field and requires basic knowledge of deep learning, machine learning, and computer vision. 


1.2 Community 


There are two main ways to discuss this book in PaddleOCR 


1. [il8n] PaddleOCR’s GitHub discussions is mainly for international developers for technical discussions and ex- 
changes, including but not limited to theoretical algorithms, technologies and applications. 


2. [Chinese] PaddleOCR Community regular season is a Chinese community activity with OCR as the core, providing 
multi-level and multi-dimensional open tasks to different types of developers, and giving multiple material and 
spiritual rewards to excellent community projects. 


1.3 Acknowledge 


We are extremely grateful to the contributors to the Chinese and English editions of this book, including but not limited to 
adding content, correcting errata, improving the structure, and providing valuable feedback. In particular, we would like to 
thank every developer who has contributed to the project. The GitHub username or name of these contributors is (in no 
particular order): LDOUBLEV, WenmuZhou, dyning, tink2123, MissPenguin, littletomatodonkey, Evezerest, andyj- 
paddle, D-DanielYang, Topdu, weisy11, BeyondYourself, JetHong, Intsigstephon, xmy0916, cuicheng01, bjjwwang, 
ZhangXinNan, hysunflower, d2623587501, Wei-JL, xxxpsyduck, Yipeng-Sun, TingquanGao, tangmq, MrCuiHao, au- 
thorfu, HexToString, GreatV, neonhuang, xiangyubo, Huntersdeng, iamyoyo, buptlihang, Lovely-Pig, OliverLPH, YukS- 
ing12, bingooo, fengxiaoshuai, lilinxiong, SibiAkkash, linkecoding, kjf4096, Sunny-wong, bupt906, XiaoguangHu01, 
Nikhil-Sawant-141, xxlyu-2046, znsoftm, xiaoyangyang2, sdcb, lyl120117, daassh, PeterHO323, before31, zhiqiu, 
zhangyingying520, DannyIsFunny, ufoym, ITerydh, fushall, baiyfbupt, OneYearIsEnough, tirkarthi, Zhouzd21, karl- 
horky, lgcy, raoyutian, ronny1996, light1003, JimEverest, Justus-Jonas, Jane-Ding, sushant1212, mengful88, Chan- 
ningss, edencfc, mymagicpower 


In addition, we would like to give a special thanks to the developers in the OCR community RangeKing, HustBestCat, 
v3fc, 1084667371, livingbody, haigang1975, fansong1983, Kongsea , fanruinet, thunderstudying, WZMIAOMIAO. They 
have made an outstanding contribution to the Chinese and English documents of the e-book. 


Besides the e-book, it is worth mentioning that “Dive into OCR’ also has a supporting Chinese video course, which is also 
deeply loved by OCR developers. It has attracted more than 8000 people to sign up for study. The English courses will 
be launched later. 
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CHAPTER 
TWO 


COURSE PREREQUISITES 


The OCR model involved in this course is based on deep learning, so its related basic knowledge, environment configu- 
ration, project engineering and other materials will be introduced in this section. Readers who are new to deep learning 
can make use of this part. 


2.1 Preliminary Knowledge 


The “learning” of deep learning is derived from the content of neurons, perceptrons, and multilayer neural networks in 
machine learning. Therefore, understanding the basic machine learning algorithms is of great help to the understanding 
and application of deep learning. The “deepness” of deep learning is embodied in the vector-based mathematical opera- 
tions such as convolution and pooling used in processing a large amount of information. If not familiar with the theoretical 
foundation of the two, you can learn it from teacher Li Hongyi’s Linear Algebra and Machine Learning courses. 


To understand deep learning itself, you can refer to the basic course of Bi Ran, an outstanding architect of Baidu: Baidu 
Architect Guides You through Deep Learning Practice, which includes the development history of deep learning and 
introduces all its components with one classic case. It is a practice-oriented course. 


To learn the practice of theoretical knowledge, it is essential to take the course of Basic Knowledge of Python. At the 
same time, in order to quickly reproduce the deep learning model, the framework used in this course is: PaddlePaddle. 
If you used other frameworks before, you can quickly learn how to use it with Quick Start Document. 


2.2 Basic Environment Preparation 


If you want to run the code in this course in a local environment and have not built a Python environment before, you can 
follow the Beginner’s Guide to Prepare the Operational Environment, and install Anaconda or docker environment based 
on your operational system. 


If you don’t have local resources, you can run the code at the AI Studio training platform. Each item in it is presented 
in a notebook, convenient for developers to learn. If you don’t know how to use Notebook, refer to AI Studio Project 
Description. 
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2.3 Code Acquisition and Running 


This course relies on the formation of PaddleOCR’s code repository. First, clone the whole project of PaddleOCR: 


# [recommended] 
git clone https://github.com/PaddlePaddle/PaddleOCR 


# If you cannot pull due to network problems, you can also choose to use the hosting. 
son Gitee: 
git clone https://gitee.com/paddlepaddle/PaddleOCR 


Note: Gitee’s code hosting service may not be able to synchronize the update of this github project in real- 
time, and there is a delay of 3~5 days. Please use the recommended method first. 


If strange to git operations, you can download the compressed package in the Code on the homepage of 
PaddleOCR 


Then install third-party libraries: 


cd PaddleOCR 
pip3 install -r requirements.txt 


2.4 Access to Information 


PaddleOCR Usage Document elaborate on how to use PaddleOCR to realize model application, training and deployment. 
The document is informative. Most of the users’questions are covered in it or FAQ, especially in FAQ, which gathers the 
common issues by following the application process of deep learning. It is recommended that you read it carefully. 


2.5 Asking for Help 


If you encounter a bug, usability or document-related issues while using PaddleOCR, you can contact the office via 
Github issue. Please follow the issue template to provide as much information as possible so that official personnel can 
quickly locate the problem. Also, our WeChat group is a daily communication position for PaddleOCR users, especially 
for consultation. Besides the PaddleOCR team members, there are also some enthusiastic developers answering your 
questions. 
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CHAPTER 
THREE 


INTRODUCTION TO OCR TECHNOLOGY 


S 
= RAM RANG USS SETH 
tid. MOIR HHA. 


CUUR REE Eda es 


Note: The above pictures are from the Internet 


3.1 Technical Background of OCR 


3.1.1 Application Scenarios 


¢ What is OCR? 


OCR (Optical Character Recognition) is a key field in computer vision. Traditional OCR is usually used to scan docu- 
ments. Now it often refers to scene text recognition (STR), mainly for natural scenes such as signs shown in the figure 
below and other texts in various natural scenes. 
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Figure 1: Document text recognition VS. Scene text recognition 


¢ What are the application scenarios of OCR? 


OCR technology has abundant application scenarios. One typical scenario is the recognition of structured texts of par- 
ticular areas, which is widely used in daily life, such as license plate recognition, bank card information recognition, [ID 
card information recognition, train ticket information recognition, and so on. The common feature of these verticals is 
that they have fixed formats. And it is very suitable to use OCR technology for automation, labor-saving and efficiency. 


This kind of recognition is currently the scene that is most extensively used and relatively mature in technology in OCR. 


General scene Traffic scene Card scene Industrial scene 


: 
»p 


Medical scene Educational scene 


Figure 2: Application scenarios of OCR technology 


In addition to the recognition of structured texts of particular areas, the general OCR technology also has various applica- 
tion scenarios and is often used to complete multi-modal tasks with other technologies. For example, in the video scene, 
OCR technology is often used for subtitle automatic translation, content security monitoring,and so on, or to finish tasks 
like video understanding and video search with visual features. 
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Figure 3: General OCR in a multi-modal scene 


3.1.2 Technical Challenges 


There are two kinds of technical challenges: the algorithmic challenges and application challenges. 
¢ Algorithmic challenges 
OCR enjoys rich application scenarios, leading to its multiple technological difficulties. Here are eight common problems: 


Perspective transformation Large scale change Curve text Background noise 


wr 


ee ae 


Different fonts Blur Illumination 


Figure 4: Technical challenges of OCR algorithms 


These problems have brought huge technological challenges to text detection and recognition. It can be seen that these 
challenges are mainly generated in natural scenes. At present, most academic research focuses on natural scenes, and 
so do the academic datasets in OCR. There are many studies concentrating on these issues. But, recognition is more 
challenging than detection. 


¢ Application challenges 


In application, especially in various general scenarios, OCR technology also faces two major difficulties in addition to those 
algorithmic ones summarized above such as affine transformation, scale problems, insufficient lighting, and shooting blur: 


1. Massive data requires OCR to achieve real-time processing. OCR is often applied to deal with massive data, so 
real-time data processing is demanded. But it is quite challenging to improve the model speed to meet its standard. 
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2. The end-side application requires that the OCR model is light enough and its recognition speed is fast 
enough. OCR is often deployed on mobile terminals or embedded hardware. There are generally two modes 
for end-side OCR applications: uploading to the server vs. terminal-side direct recognition. Considering that the 
previous method has higher requirements for the network and perform not well in real-time, that the server is under 
high pressure with large request volumes, and that there may be security risks in data transmission, we hope to 
adopt the latter method. However, the storage space and computing power of the terminal side are limited, so there 
are high requirements for the size and inference speed of the OCR model. 


Inference 
speed 


Figure 5: Technical challenges of OCR application 


3.2 OCR Cutting-edge Algorithms 


Although OCR is relatively specific, it involves many aspects of technologies, including text detection, text recognition, 
end-to-end text recognition, document analysis, and so on. Academic research on related technologies of OCR flourishes. 
The following part will briefly introduce some several key technologies in the OCR task. 


3.2.1 Text Detection 


The text detection task is to locate text regions of the input image. In recent years, there are much academic research 
on text detection. A class of methods regard text detection as a specific scene in target detection, and modify general 
target detection algorithms for text detection. For example, TextBoxes[1] is based on one-stage target detector SSD. The 
algorithm [2] adjusts the target frame to fit text lines with extreme aspect ratios, while CTPN [3] is developed from the 
Faster RCNN [4]. However, there are still some differences between text detection and target detection in the target 
information and the task itself. For instance, texts are often quite long and look like “stripes”, line space is small, texts are 
curved, etc. Therefore, many algorithms especially for text detection have been derived, such as EAST[5], PSENet[6], 
DBNet[7] and so on. 
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Figure 6: Example of text detection task 


At present, some popular text detection algorithms can be roughly divided into two categories: Regression-based Algo- 
rithms and Segmentation-based Algorithms. There are also some algorithms combining the two. Algorithms based 
on regression draw on general object detection algorithms, realize detection box regression by setting the anchor, or di- 
rectly perform pixel regression. This type of methods perform well on discerning regularly-shaped texts, but badly on 
irregularly-shaped texts. For example, CTPN [3] is good at the recognition of horizontal texts, but poor in the detection 
of twisted and curved texts. SegLink [8] is more suitable for long texts, but does not fit in detecting sparsely-distributed 
texts. Algorithms based on segmentation introduced Mask-RCNN [9], this type of algorithms can perform better in the 
detection in various scenes and texts of various shapes, but the disadvantage is that the post-processing is complicated, 
so it may be slow in speed and cannot detect overlapped texts. 
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Figure 7: Overview of text detection algorithms 


image segmentation map binarization map detection results 


threshold map 


Figure 8: (left) Anchor optimization of CTPN[3] algorithm based on regression (middle) Optimized post-processing of DB[7] algorithm t 


The technologies related to text detection will be interpreted and practiced in Chapter 2. 


3.2.2 Text Recognition 


Text recognition is to recognize the text content in the image, and the input generally comes from the textual area of the 
image cut out by the text box generated by text detection. Text recognition can generally be divided into two categories: 
Regular Text Recognition and Irregular Text Recognition according to the contour of the text to be recognized. 
Regular text mainly refers to printed fonts, scanned text,and so on which is roughly horizontal. Irregular text is often not 
in a horizontal position, and often curved, covered, and blurred. Irregular text scenes are very challenging, and it is also 
the main research direction in text recognition. 
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Figure 9: (Left) Regular texts VS. (Right) Irregular texts 


The algorithms of regular text recognition can be divided into two types according to the different decoding methods: 
CTC-based algorithm and Sequence2Sequence-based algorithm . They differ in the way to convert the sequence features 
learned by the network into the final recognition result. A representative of the algorithm based on CTC is classic CRNN 


Input image 


Cons: not use context 


information, and bad 


text 


(OO iStater i ttt 
Transcription ! t ! seayence 
Layer | } t Per-frame 
| ESERCPREEE } Rye. | 
Me ctcee el .° (disbritutions) | 
- | 7 CRNN+CTC - Seq2seq+attention 1 
eae Deep 1 - Introduce a space character | 
nd bidirectional . - Labels do not require i 
“i LSTM character level alignment ’ 
Recurrent mole Ms Sa ee ea es es eee SON Ae hs * A B (#7 D <eos> X Y Z 
Layers t 
‘ t t ; 
Mf? Mk? one yee eee ee . hy 
er , @ CRNN O R2AM : 
i Convolutional, | O STAR-Net O SAR ' : 
——_ feature maps O Rosetta O RARE 1 : 
, i Oo. oOo. ey Context vector | 
Convolutional t i : ie fl Global align weights j 
Layers , Pros: high efficiency, Pros: higher accuracy |! ' 
Convolutional | ! | 
feature maps , good for regular and Cons: poor effect for ; : 
1 t 
t long text. too long or too short ; 3 
i 1 
1 
\ I 


‘ i irregular text. 


Figure 10: CTC-based recognition algorithm VS. Attention-based recognition algorithm 


The recognition algorithms for irregular texts are more abundant. Methods like STAR-Net [12] correct contours of 
irregular texts into regular rectangles by adding correction modules such as TPS before performing recognition. Attention- 
based methods like RARE [13] pay more attention to the correlation of parts between sequences. The segmentation-based 
methods treat each character of a text line as a single unit, making it easier to recognize a segmented character than 
to recognize the entire text line after correction. In addition, with the rapid development of Transformer [14] and its 
effectiveness verified in various tasks in recent years, a number of transformer-based text recognition algorithms have 
flourished. This kind of solutions use the transformer structure to solve the long-term dependency on modeling of CNN 


and have achieved good results. 
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Figure 11: Recognition algorithm based on character segmentation [15] 


The related technologies of text recognition will be interpreted and actual combat in detail in Chapter 3. 
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3.2.3 Document Structured Recognition 


Traditionally, OCR technology can meet the demand for text detection and recognition. However, in practical scenarios, 
what we need usually is structured information, such as extraction of ID cards and invoices, structured identification 
of tables, and so on. The application scenarios of OCR technology are mostly express document extraction, contract 
content comparison, financial factoring document information comparison, and logistics document identification. OCR’s 
result + post-processing is a commonly used structuring scheme, but it is complicated, and post-processing needs to be 
carefully designed and is poor in generalization. As OCR technology continues to prosper and the demand for structured 
information extraction is growing, various technologies concerning intelligent document analysis, like layout analysis, table 
recognition, and key information extraction, have gained increasing attention . 


¢ Layout Analysis 


Layout analysis is made to classify the content of document images into categories like plain texts, titles, tables, pictures, 
etc. Current methods generally detect or segment them respectively. For example, Soto Carlos [16] uses contextual 
information and the inherent position of the document content to improve the region detection performance based on the 
target detection algorithm Faster R-CNN. Sarkar Mausoom et al.[17] propose a priori-based segmentation mechanism 
to train the document segmentation model with high-resolution images, solving the problem that different structures in 
dense regions cannot be distinguished and merged due to excessive reduction of the original image. 
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Figure 12: Layout analysis 
¢ Table Recognition 


Table recognition is to identify and transfer the table information of the document into an excel file. There are diverse 
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types and styles of tables in text images, such as various rowspans and colspans and different text types. In addition, 
the style of the document and the light environment during shooting have brought great challenges to table recognition, 
making table recognition a research difficulty in document understanding. 
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Figure 13: Table recognition 
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There are many table recognition methods. For example, in early days, there were traditional algorithms based on heuris- 
tic rules, such as the T-Rect algorithm proposed by Kieninger [18] et al-which generally use manual design rules and 
connected domain detection and analysis. In recent years, as deep learning continue to develop, some CNN-based table 
structure recognition algorithms have emerged, such as DeepTabStR proposed by Siddiqui Shoaib Ahmed [19] et al. and 
TabStruct-Net proposed by Raja Sachin [20] et al. In addition, with the rise of Graph Neural Network, some researchers 
try to apply Graph Neural Network to table structure recognition and regard table recognition as a graph reconstruction 
issue on the basis of the Graph Neural Network. This is the way that TGRNet proposed by Xue Wenyuan [21] et al. 
works. What’s more, there are end-to-end solutions which output the table structure in HTML with the network. Most of 
these adopt Seq2Seq to predict the table structure such as those based on attention or transformer, including TableMaster 


[22]. 
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Figure 14: Table recognition methods 


¢ Key Information Extraction 


Key information extraction (KIE) is an important task in Document VQA. It refers to the extraction of the needed infor- 
mation from images, such as that of name and ID number from ID cards. Such information is fixed in one task, but is 
different between different tasks. 


Q1: What's the address of the house? 


A1: Room XXX, Building No.X, XX District, 
Beijing, China 


Q2: What is the area of the house ? 


A2 : 90.69 square meters 


Figure 15: Doc VQA tasks 


KIE is usually divided into two sub-tasks for research: 


¢ SER: It refers to semantic entity recognition which classifies each detected text. For example, it divides texts into 
name and ID cards like the black and red boxes in the figure below. 
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¢ RE: It refers to relation extraction, which classifies each detected text. For example, it may categorize texts into 
questions and answers, and then find the corresponding answer for each question. As shown in the figure below, the 
red and black boxes represent questions and answers respectively, and the yellow arrows show the correspondence 
between questions and answers. 


Figure 16: SER and RE tasks 


The general KIE method is developed based on the named entity recognition (NER) [4], but this kind of method only uses 
the text information in the image without employing visual and structural information. Therefore, it is not so accurate. 
In this way, in recent years, many solutions have begun to merge visual and structural information with text information. 
Due to the adoption of different principles in fusing multi-modal information, these methods can be divided into four 
types: 


¢ Grid-based method 
¢ Token-based method 
¢ GCN-based method 
e End-to-end method 


Relevant technologies of Document analysis will be demonstrated and practiced in Chapter 6. 


3.2.4 Other Technologies 


Three key technologies in the OCR field are introduced above: text detection, text recognition, document structured 
recognition, and other cutting-edge technologies related to OCR like end-to-end text recognition, image preprocessing 
technology, and OCR data synthesis. For more details, please refer to Chapter 7 and Chapter 8. 
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3.3 Industrial Practice of OCR Technology 


Wong, please develop an 
invoice identification APP 

in one week. If you can't is 
Finish tt, you'll deduct an 


If you were Xiao Wang, what would you do? 
1. I know nothing about this. I'd give it up. 


2. I'd recommend the boss to find an outsourcing company or propose a commercial project. Anyway, it 
is the boss’s bill. 


3. Id find similar projects online, and perform Github-oriented programming. 


OCR technology aims to be applied in industrial practice. Although there is a lot of academic research on OCR technology 
and its commercial application is more mature than other AI technologies, there are still some difficulties in industrial 
application. The following section will analyze the difficulties in technology and industrial practice. 


3.3.1 Difficulties in Industrial Practice 


In industrial practice, developers often need to rely on open-source community resources to start or promote projects. But 
they often encounter three problems in the use of open-source models: 
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time-consuming training problems 


Figure 17: Three major problems in the industrial practice of OCR technology 


1. Hard to find & hard to select 


The open-source community enjoys rich resources, but information asymmetry will hinder developers from solving pain 
points efficiently. On one hand, resources of the open-source community are too rich for developers to quickly find 
projects matching the business requirement in a huge code repository when facing a requirement. This is the “hard to 
find” problem. On the other hand, in the selection of algorithms, developers have to verify them one by one for indicators 
on English public datasets cannot provide direct references to the Chinese scenarios that they encounter a lot, which is 
time-wasting and labor-consuming and even cannot guarantee that the selected ones are the most suitable. And this is the 
“hard to select” problem. 


2. Not applicable to industrial scenarios 


The work in the open-source community focuses more on effect optimization such as open source or reproduction of 
codes in academic papers, and algorithmic effects instead of the model size and speed. But the two indicators are as 
important as the model effect and cannot be ignored in industrial practice. No matter on the mobile side or the server side, 
the number of images to be recognized is so large that the model is hoped to be smaller, more accurate, and faster in 
inference. But GPU is too expensive, so it is more economical to use CPU. On the premise of meeting business demands, 
the lighter the model is, the fewer resources it will take. 


3. Hard to optimize & problematic to train and deploy 


Solely using open-source algorithms or models cannot meet business needs directly. In actual business scenarios, OCR 
needs to be used to deal with various problems. The personalization of business scenarios often requires the retraining of 
custom datasets. It is expensive to experiment with optimization solutions in the current open-source projects.In addition, 
OCR has been applied to a lot of scenarios for there is a wide range of application demands on the server and mobile 
devices. Therefore, diverse hardware environments should support various deployment methods. But the open-source 
community’s projects focus more on algorithms and models, and lack support in inference and deployment. To apply OCR 
technology of the algorithms in papers, it has high requirements for algorithmic and engineering abilities of developers. 


3.3.2 Industry-level OCR Development Kit —PaddleOCR 


OCR industry practice demands a set of full-process solutions to speed up research and development process and save 
some time. In other words, the ultra-lightweight model and its full-process solutions are a rigid demand especially for 
mobile terminals and embedded devices with limited computing power and storage. 


So, the industry-level OCR development kit PaddleOCR has come into being. 


The construction of PaddleOCR starts from user portraits and needs, selects and reproduces diverse cutting-edge algo- 
rithms with the core framework of PaddlePaddle, develops PP models that are more suitable for industrialization based 
on recurred algorithms, and integrates training and inference to provide various inference and deployment methods and 
meet different demand scenarios in application. 


20 Chapier 3. Introduction to OCR Technology 


Dive into OCR 


Application Financial License Office F 4 Educational 
aes plates | documents | Medical Pils | _ scenes 
Paddle Inference Paddle Serving Paddle Lite Data tools 
(server) (online service) (mobile device) 
Deployment 


PaddleSlim (Model compression) Oca 


- 2 automatic 
Pretrained models : : Document analysis data 


; z annotation 
Solution Solution tool) 


PP-OCR PP-Structure 
Algorithms Algorithms 


Key information 


Detection Recognition End2end 

CRN 5 

EAST eon extraction Style-Text 
DB RARE PGNet (Data 

SAST STAR-Net ABCNet synthesis 
PSENet a Semantic entity tool) 


Layout analysis 


Table recognition 


73/2 KR PaddlePaddle 


Figure 18: Panorama of the PaddleOCR development kit 


recognition 


It can be seen from the panorama that PaddleOCR provides abundant solutions in model algorithms, pre-training model 
libraries, and industry-level deployment with the help of the core framework of PaddlePaddle, and provides tools of data 
synthesis and semi-automatic data annotation to promote developers’ data production. 


As for model algorithms, PaddleOCR provides solutions for text detection and recognition and document structured 
analysis respectively. In terms of text detection and recognition, PaddleOCR has reproduced or open-sourced four text 
detection algorithms, eight text recognition algorithms, and one end-to-end text recognition algorithm. On this basis, 
a universal text detection and recognition solution of the PP-OCR series has been developed. In terms of document 
structured analysis, PaddleOCR provides algorithms such as layout analysis, table recognition, key information extraction, 
and named entity recognition, and has proposed a PP-Structure document to analyze solutions on this basis. A variety of 
select algorithms can meet the needs of developers in different business scenarios. The unification of the code framework 
also facilitates the optimization and performance comparison of different algorithms for them. 


At for pre-training model libraries, with PP-OCR and PP-Structure solutions, PaddleOCR has developed and open- 
sourced PP series models fitting industrial practice, including universal, ultra-lightweight and multlingual text detection 
and recognition models, and complex-document analysis models. The PP series models are thoroughly optimized based 
on the original algorithms to make their effect and performance meet the standards in industrial practice. Developers can 
easily develop a “practical model” for their own business needs by either directly applying the models to business scenarios 
or using business data for finetuning. 


As for industry-level deployment, PaddleOCR provides a server-side inference solution based on Paddle Inference, a 
service-based deployment solution based on Paddle Serving, and an end-side deployment solution based on Paddle-Lite 
to meet the deployment needs under different hardware environments. Also, it provides a model compression scheme 
based on PaddleSlim, which can further compress the model size. The above deployment methods have got through the 
whole process of training and inference to ensure that developers can make deployment efficiently, stably, and reliably. 


As for data tools, PaddleOCR provides a semi-automatic data annotation took—PPOCRLabel and a data synthesis 
tool Style-Text to help developers produce the datasets and annotation information required for model training more 
conveniently. PPOCRLabel, as the first open-source semi-automatic OCR data annotation tool in the industry, is aimed 
at solving the problems of the tedious and mechanical labeling process, massive demands for manual labeling of training 
data, and high costs of time and money.And it has introduced the built-in PP-OCR model to realizes the mode of pre- 
labeling + manual verification, which can greatly improve labeling efficiency and save labor costs. The data synthesis tool 
Style-Text focuses on the solution to serious shortage of real data in actual scenes and the failure to synthesize text styles 
(fonts, colors, spacing, background) of traditional synthesis algorithms. Only with a few target scene images can a large 
number of text images similar to target scene in style be synthesized. 
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Figure 19: Schematic of using PPOCRLabel 
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Figure 20: Examples of Style-Text synthesis results 
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PP-OCR and PP-Structrue 


The PP series models are thoroughly optimized to meet needs of industrial practice by visual development kits of Pad- 
dlePaddle, aiming to strike a balance between speed and accuracy. The PP series models in PaddleOCR include PP-OCR 
series models for text detection and recognition and PP-Structure series models for document analysis. 


(1) Chinese and English model of PP-OCR 
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Figure 21: Examples of Chinese and English model of PP-OCR on recognition results 


The typical two-stage OCR algorithm adopted by the Chinese and English models of PP-OCR is in the paradigm of 
detection model + recognition model.And its concrete algorithm framework is as follows: 
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Figure 22: Schematic of the PP-OCR system pipeline 


It can be seen that in addition to input and output, the core framework of PP-OCR contains three modules: text detection, 
detection frame correction, and text recognition. 


¢ Text detection module: Its core is a text detection model trained on the DB detection algorithm, used to detect the 
text area in the image. 


¢ Detection frame correction module: Input the detected text box into the detection frame correction module. At 
this stage, the irregular text box is corrected into a rectangular frame, preparing for text recognition. Also, the 
text direction will be judged and corrected. For example, if the text line is judged to be upside down, it will be 
corrected. This function relies on training a text direction classifier. 


¢ Text recognition module: Finally, the module performs text recognition on the corrected detected box to discern 
the text and obtain the content. The classic text recognition algorithm used in PP-OCR is CRNN. 


PaddleOCR has introduced PP-OCR[23] and PP-OCRv2[24] models. 


PP-OCR model has a mobile version (lightweight version) and a server version (universal version). The mobile version 
model is mainly optimized based on the lightweight backbone network MobileNetV3, and its optimized model (detection 
model + text direction classification model + recognition model) is only 8.1M in size, takes 350ms to predict a single image 
on the CPU, and takes about 110ms on T4 GPU. After cropping and quantization, its size can be further compressed to 
3.5M with the same accuracy, convenient for end-side deployment. The previous model inference test only takes 260ms 
on the Snapdragon 855 processor. For more evaluation data of PP-OCR , please refer to benchmark. 


PP-OCRv?2 keeps using the overall framework of PP-OCR, and performs policy optimization of effects, mainly in 3 
aspects: 


* Its model effect increases by more than 7% over the PP-OCR mobile version; 

¢ Its speed grows by 220% compared with the PP-OCR server version; 

¢ Its model size of 11.6M makes it easy to be deployed on both server and mobile terminals. 
The optimization policies of PP-OCR and PP-OCRv? will be detailed in Chapter 4. 


In addition to the Chinese and English models, PaddleOCR has also trained and open-sourced the English digital model 
and the multi-language recognition model with different datasets. All of these are ultra-lightweight and suitable for dif- 
ferent language scenarios. 
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Figure 23: Schematic of the recognition result of the English digital model and multilingual model of PP-OCR 


(2) PP-Structure document analysis model 
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PP-Structure supports three subtasks: layout analysis, table recognition, and DocVQA. 


There are six core functions of PP-Structure: 


¢ Performing layout analysis of documents in the form of pictures, which can be divided into 5 types of areas: texts, 


titles, tables, pictures and lists (used together with Layout-Parser) 


¢ Extracting texts, titles, pictures and lists as text fields (used together with PP-OCR) 


¢ Conducting structured analysis for tables, and the final result is output in an Excel file 


¢ Supporting the Python whl package and the command line, simple and user-friendly 


¢ Supporting custom training for two types of tasks: layout analysis and table structuring 


¢ Supporting VQA tasks— SER and RE 
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5. Conclusion 


Figure 24: Schematic of PP-Structure system (this figure only contains layout analysis + table recognition 


The specific plan of PP-Structure will be explained in detail in Chapter 6. 


Industrial-level Deployment Plan 


It is available to conduct full-process and full-scene inference and deployment on PaddlePaddle, with three main sources 
of models. The first is the training by using a network structure built with the PaddlePaddle APIs. The second is a series 
of PaddlePaddle toolkits which provides multiple model libraries and concise and easy-to-use APIs, and can be used out- 
of-the-box, including the visual model library PaddleCV, intelligent speech library PaddleSpeech and natural language 
processing library PaddleNLP, etc. The third kind of models is generated from third-party frameworks (PyTorh, ONNX, 
TensorFlow, etc.) bying using X2Paddle tools. 


PaddlePaddle models can be compressed, quantified, and distilled by using PaddleSlim tools. They supports five deploy- 
ment schemes, including Paddle Serving (service-based), Paddle Inference (server-side or cloud-side), Paddle Lite (mobile 
or edge-port) , Paddle. js (front end of the web). Some hardware unavailable to such as MCU, Horizon, Kunyun and other 
domestic chips can be converted into a third-party framework supportive to ONNX with the help of Paddle2ONNX. 


26 Chapier 3. Introduction to OCR Technology 


Dive into OCR 


Model preparation Model optimization Deployment 
/ . a Paddle Serving \ 
PaddePadcle ———_—Padeade 
develop+training — models Paddle Inference 

PaddleSlim Server / Cloud 

eae tee Paddle Lite 

PaddleCv / PaddleSpeech / PaddleNLP / ... |___, | compression/ | __, : : 
| quantization/ | Mobile device 

| distillation; 
ee ee eee Paddle,js 
Web 
PyTorch / ONNX / Tensorflow X2Paddle Third party framework 


models transformatio supporting ONNX 
-) I Paddle2ONNX a 


Figure 25: Deployment methods available on PaddlePaddle 


Paddle Inference supports server-side and cloud deployment, with high performance and versatility. It has been thoroughly 
adapted and optimized for different platforms and different application scenarios. Paddle Inference is the native inference 
library for PaddlePaddle to ensure that models can be used as soon as they are trained on the server side and can be 
deployed quickly. It is suitable for high-performance hardware to use multiple language environments of applications to 
deploy models with complex algorithms. The hardware includes x86 CPUs, Nvidia GPUs, and AI accelerators such as 
Baidu Kunlun XPU and Huawei Shengteng. 


Paddle Lite is an end-side inference engine featuring light weight and high performance. It has been configured and 
optimized in-depth for end-side devices and application scenarios. Currently it supports multiple platforms such as An- 
droid, IOS, embedded Linux devices, macOS, and so on. The hardware covers ARM CPU and GPU, X86 CPU and new 
hardware like Baidu Kunlun, Huawei Ascend and Kirin, Rockchip, etc. 


Paddle Serving is a high-performance service framework designed to help users quickly deploy models in cloud services 
in a few steps. At present, Paddle Serving supports involves custom pre-processing, model combination, update of model 
hot reload, multi-machine multi-card multi-model, distributed inference, K8S deployment, security gateway and model 
encryption deployment, and multilingual and multi-client access. The official of Paddle Serving also provides deployment 
examples of more than 40 models including PaddleOCR, to help users get started faster. 
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Figure 26: Supported deployment mode of PaddlePaddle 


The above deployment plans will be detailed and practiced based on the PP-OCRv2 model in Chapter 5. 


3.4 Summary 


This section first introduces the application scenarios and cutting-edge algorithms of OCR technology, and then analyzes 
the difficulties and three major challenges of OCR technology in its industrial practice. 


The content of the follwings chapters is as follows: 
¢ Chapter2 and Chapter3: text detection and recognition and their practice; 
¢ Chapter 4: PP-OCR optimization policies; 
¢ Chapter 5: Practice of inference and deployment; 
¢ Chapter 6: Document structuring; 
¢ Chapter 7: Other OCR-related algorithms such as end-to-end algorithm, data preprocessing, and data synthesis; 


¢ Chapter 8: OCR-related datasets and data synthesis tools. 
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CHAPTER 
FOUR 


TEXT DETECTION 


Text detection is to find out the position of text in an image or video. Different from object detection, object detection is 
intended to solve not only the positioning problem, but also the problem of target classification. 


The representation of text in images can be regarded as a kind of ‘object’, so object detection methods also fit into text 
detection. Find out their similarity and difference in terms of their tasks: 


* Object detection: finds out the box of the target of the given image or video, and make classification; 


¢ Text detection: finds out the text area of the given image or video, which can be a single character or a whole text 
line; 


: Schematic of object detection 
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Figure 2: Schematic diagram of text detection 
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Object detection and text detection both concern “location”. But the latter one does not need to classify the target, and 
the shape of the text is complex and diverse. 
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Currently, text detection usually refers to scene text detection, which encounters difficulties for: 


1. Diversity of texts in natural scenes: text detection may be affected by text color, size, font, shape, direction, lan- 
guage, and length. 


2. Complexity of backgrounds and interference: text detection may be affected by image distortion, blurring, low 
resolution, shadow, brightness and other factors. 


3. Dense or overlapped texts. 


4. Partial identity of texts: a small part of a text line can also be considered as an individual text. 


hadineks eT 
Pe AG ta) ; E 
baie 


Vs Job ReCaae 


O7obh-FERATSES. 
ORSL-KEMATEEB. 


mere y H. 


Figure 3: Text detection scenes 


Many text detection algorithms based on deep learning have emerged to solve the problems of text detection in natural 
scenes. These methods can be divided into regression-based and segmentation-based text detection methods. 


The next section will briefly introduce the classic text detection algorithms based on deep learning technology. 


4.1 Introduction to Text Detection Methods 


In recent years, there are growing deep learning-based text detection algorithms. These methods can be roughly classified 
into two categories: 


1. Regression-based methods 
2. Segmentation-based methods 


This section selects The methods commonly used from 2017 to 2021 are selected here and classified into the above two 
types of methods in the following table: 
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Figure 4: Text detection algorithms 


4.1.1 Regression-based Text Detection 


Regression-based algorithms are similar to object detection algorithms. And there are only two parts in text detection 
methods: the text of an image is the target to be detected, and the rest is the background. 


Horizontal Text Detection Algorithms 


In the early days, the methods based on deep learning are modified object detection algorithms, supporting horizontal text 
detection. For example, the TextBoxes algorithm is improved from the SSD algorithm, and the CTPN from Fast-RCNN, 
the two-stage object detection algorithm. 


The TextBoxes[1] algorithm is adjusted according to the one-stage target detector SSD. The default text box is changed 
to a quadrilateral that fits the direction and aspect ratio of the text. The algorithm also provides an end-to-end training 
text detection method without complicated Post-processing. 


¢ The pre-selection box is larger in the aspect ratio. 
¢ The convolution kernel has been changed into 1x51 from 3x3, more suitable for detecting long texts. 
¢ Multi-scale input is adopted 


Based on the Fast-RCNN algorithm, CTPN[3] expands the RPN module and designs a CRNN-based module to enable the 
network to detect text sequences from convolutional features. The two-stage method can locate features more accurately 
through ROI Pooling. But TextBoxes and CTPN can only detect horizontal texts. 
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Figure 6: Frame diagram of CTPN 


Detection of Texts at Any Angle 


TextBoxes++[2] is modified from TextBoxes, and can detect texts at any angle. In terms of structure, unlike TextBoxes, 
TextBoxes++ is designed to detect multi-angle texts. First, it modifies the aspect ratio of the preselection box and adjusts 
the aspect ratio to 1, 2, 3, 5, 1/2, 1/3, and 1/5. Second, it changes the 1 « 5 of convolution kernel to 3 * 5 to better learn 
the characteristics of the tilted text. Finally, it outputs the represented information of the rotating box. 
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Figure 7: Frame diagram of TextBoxes++ 


EAST [4] adopts a two-stage text detection method for the location of tilted texts, including FCN feature extraction and 
NMS. EAST proposes a new text detection pipeline structure, which can conduct end-to-end training and detect texts 
in any orientation. It is also simple in structure and excellent in performance. FCN supports the output of inclined 
rectangular and horizontal boxes whose output formats can be decided by users. 


¢ Ifthe output shape is RBox, output the rotation angle of the box and the AABB text shape, AABB represents shifts 
of the box on the up, down, left, and right sides. RBox can rotate rectangular texts. 


¢ If the output detection box is a four-point box, the last dimension of the output should be in 8 numbers, which 
means the text shift from the four vertices of the quadrilateral. This output method can predict texts in the shape 
of irregular quadrilaterals. 


Text boxes output by FCN is redundant. For example, the box generated by the adjacent pixels of a text area is highly 
overlapped. But those derived from the same text do not go this way. Therefore, EAST proposes to merge the prediction 
boxes in rows, and then filter the remaining quads with the original NMS. 
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Figure 3. Structure of our text detection FCN. 


Figure 8: Frame diagram of EAST 


MOST [15] has put forward the TFAM module to dynamically adjusts the receptive field according to the coarse-grained 
detection results, and also proposed PA-NMS to combine reliable detection and prediction results based on the location. In 
addition, the Instance-wise IoU loss function has also been put forward during training, which is used to balance training 
to handle text of different scales. This method can be combined with the EAST and thus obtains effects and performance 
in detecting texts with extreme aspect ratios and different scales. 
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Figure 9: MOST frame diagram 


Arbitrary Shape Scene Text Detection 


Among ideas to use regression to detect curved texts, a simple one is to describe the polygons with boundaries of curved 
texts with multi-point coordinates, and then predict the vertex coordinates of the polygons. 


CTD [6] has made a proposal to predict the polygons curved texts with 14 vertices. The network uses the Bi-LSTM [13] 
layer to refine the predicted coordinates of the vertices and detect curved texts based on regression. 


4.1. 
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Figure 10: Frame diagram of CTD 


For the detection of long texts and curved texts, LOMO [19] has proposed an iterative method for the optimization of 
text localization features to obtain more accurate text locations. LOMO consists of a direct regressor (DR), an iterative 
refinement module (IRM), and a shape expression module (SEM). Text areas are generated by DR, then IRM refines text 
localization features iteratively, and finally an SEM is introduced to predict the text region, the text center line and border 
offsets. Iterative optimization of text features can better solve long text localization and locate more accurate text regions. 
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Figure 11: Frame diagram of LOMO 


Contournet [18] suggests modeling text contour points to get a detection box of the curved text. First, an Adaptive-RPN 
is proposed to generate the proposal of the text region. Then, a Local Orthogonal Texture-aware Module (LOTM) learns 
horizontal and vertical texture features, and represents them with contour points. Finally, considering the feature responses 
in two orthogonal directions, the Point Re-Scoring algorithm is adopted to filter the prediction of the strong unidirectional 
or weakly orthogonal activation, and assure that the text contour can be represented with a group of high-quality contour 
points. 
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Figure 12: Frame diagram of Contournet 


PCR [14] has advised to detect curved texts with progressive coordinate regression. The solution has three stages. First, 
detect the rough text region and get a text box. Second, the Contour Localization Mechanism is designed to predict the 
corners of the smallest bounding box of the text. Finally, pile CLM modules and RCLM modules to predict the curved 
text. The progressive contour regression method can help get more refined text representations, which can not only protect 
the coordinate regression from being affected by redundant noise, but also locate the text region more accurately. 


Input Image 


F Contour Scoring 


oe Mechanism (CSM) 


Information 


Contour Text or 
Locations Non-text 


Box Contour T Reereeetinel Prediction t Of —(+-)-——- New Contour Contour 
Sample Points (CIA) Head (OPH) N i Points Points 
(Nx2) (Nx2) (Nx2) 


Figure 13: Frame diagram of PCR 


4.1.2 Segmentation-based Text Detection 


Although the regression-based methods work well in text detection, it is often difficult for them to outline a curved text 
with a smooth curve and their models are more complex and not superior in performance. Therefore, researchers have 
proposed a text segmentation method based on image segmentation. First, classify the pixels, determine whether each 
pixel matches one text target. Then obtain a probability graph of the text region, and get the curve outlining the segmented 
text through post-processing. 
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Figure 14: Schematic of text segmentation algorithm 


Such methods are usually based on segmentation to achieve text detection, and segmentation-based methods have natural 
advantages for text detection with irregular shapes. The main idea of the segmentation-based text detection method is 
to obtain the text area in the image through the segmentation method, and then use OpenCV, polygon and other post- 
processing to obtain the minimum enclosing curve of the text area. 


Pixellink [7] uses segmentation to detect texts. The segmented object is a text region. The pixels of the same text 
line (word) are linked together to segment the text, and the text box is extracted from segmentation without position 
regression. And the text detection result is as good as that of the regression-based ones. However, there is a problem with 
the segmentation-based method. For texts in similar positions, the segmented regions easily “adhere” to one another. Wu, 
Yue et al. [8] have proposed to separate texts and learn positions of the text boundaries to better distinguish text regions. 
In addition, Tian et al. [9] have proposed to map the pixels of the same text to the mapping space, where distances of the 
mapped vectors of the same text are close, and those of different texts stay far away from each other. 
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Figure 15: Frame diagram of PixelLink 
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For the multi-scale issue of text detection, MSR [20] recommends to extract multi-scale features from the same image, 
then merge these features and upsample them to the original image size. Finally, the multi-scale network predicts the 
text center area and the offsetX and offsetY of each point in the center area to the nearest boundary point. Finally,the 
coordinate set of the text region contour can be obtained. 
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Figure 16: Frame diagram of MSR 


Considering that segmentation-based text algorithms have difficulties in distinguishing adjacent texts, PSENet [10] adopts 
a novel progressive scale expansion network to learn segmented text regions, predict text regions with different shrinkage 
ratios, and expand the detected text regions. The essence of this method is a variant of the method of learning boundaries, 
effectively detecting adjacent texts of any shape. 


Progressive Scale Expansion (g) Scale Expansion 


Figure 17: Frame diagram of PSENet 


Assume that three kernels of different scales are used in PSENet post-processing, as shown in the above figure s1, s2, and 
s3. First, the minimum kernel s1 calculates the connected domain of the text segmented area, gets (b), then expands the 
connected domain all around, and classifies the pixels of s2 instead of s1 in the expanded area. When there are conflicts, 
it will repeat the expansion under the principle of “first come first served”, and finally all text lines can be segmented into 
individual regions. 


For curved texts and dense texts, Seglink++ [17] proposes a characterization of the attraction and repulsion between text 
segments, uses a minimum spanning tree algorithm to combine segments to get the text detection box and an instance- 
aware loss function to enable its end-to-end training. 
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Figure 18: Frame diagram of Seglink++ 
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Although segmentation realizes curved text detection, but complex post-processing logic and prediction speed also need 
to be optimized. 


PAN [11] aims at speeding up text detection and prediction by improving the network design and post-processing to 
enhance the performance of the algorithm. First, PAN uses the lightweight ResNet18 as the backbone, and designs the 
lightweight feature enhancement module FPEM and feature fusion module FFM to improve features extracted by the 
backbone. In terms of post-processing, PAN adopts pixel clustering to merge pixels whose distance from the kernel is 
less than the threshold d around the predicted text center (kernel). This solution can guarantee both accuracy and fast 
prediction speed. 
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Figure 19: Frame diagram of PAN 
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DBNet [12] is aimed at optimizing the time-consuming post-processing in segment-based methods that require to use 
the threshold for binarization. It proposes a learnable threshold and a binarization function similar to the step function to 
ensure the segmentation network learns the threshold of segmentation end-to-end in the training. The automatic adjust- 
ment of the threshold not only improves accuracy, but also simplifies post-processing and improves the performance of 
text detection. 
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Figure 20: Frame diagram of DB 


FCENet [16] expresses the curve outlining the text with Fourier transform parameters. Since the Fourier coefficient can 
theoretically fit any closed curve, FCENet designs a suitable model to predict the representation of the outlining curve in 
arbitrary shapes based on Fourier transform. In this way, the detection accuracy of highly curved texts in natural scenes 
can be improved. 
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Figure 21: Frame diagram of FCENet 


4.2 Practice of a Text Detection Algorithm DBNet 


This section will introduce how to use PaddleOCR to complete the training and implementation of DB algorithm of text 
detection, including: 


1. Quickly calling PaddleOCR package to try text detection 
2. Understanding the principle of DB algorithm 

3. Learning procedures of building the text detection model 
4. Learning the training of the text detection model 


Note: paddleocr refers to PaddleOCR whl package. 
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4.2.1 Quick Start 


This section will take paddleocr as an example to introduce how to quickly implement text detection in three steps. 
1. Install PaddleOCR whl package 
2. Issue a command to run the DB algorithm to get the detection result 
3. Visualize text detection result 


Install PaddleOCR whl package 


‘pip install --upgrade pip 
‘pip install paddleocr 
Issue a command to implement text detection 


At the first implementation, paddleocr will automatically download and operate PP-OCRv?2 lightweight model in 
PaddleOCR’s github repository. 


Inputting the image . /12.4pg in the installed paddleocr will get the following result: 
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Figure 1: /12.jpg 


[(79.0, 555.0], [398.0, 542.0], [399.0, 571.0], [80.0, 584.0]] 
[l21.0, S07 20) ,. | [Sa2.0,; 492.0], [S13.0, 232.0); (22.0; 548.07] 
[[174.0, 458.0], [397.0, 449.0], [398.0, 480.0], [175.0, 489.0]] 
[[42.0, 414.0], [482.0, 392.0], [484.0, 428.0], [44.0, 450.0]] 


The inference result has four text boxes, each of which contains four coordinates, namely a coordinate cluster of each text 
box, and it is arranged in clockwise order from the upper left. 


The paddleocr command calls the text detection model to predict the image . /12 . j~g ,which is shown as follows: 
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import os 
# Modify the default directory where Aistudio code runs to /home/aistudio/ 
os .chaan(M/home/austudao0/ \) 


# “--image_dir’* point to the image path to be predicted; °--rec’* false means nou 
orecognition is used, only text detection is performed 
'paddleocr --image_dir ./12.jpg --rec false 


2021/12/22 21:07 221] root INEQY F#*#AES ESA / TD DORR REAR RE 

2021/12/22 21:07:23] root INFO: 19.0; 5553/0], [398.0, 542.0], [399:0; 571.0], [80. 
30, 584.0)] 

2021/12/22 21:07:23] root INFO: 21.0; 50750]),. [S12.0;. 491.0], [S13.0, 532.0), [22s 
60, 548.0]] 

2021/12/22 21:07:23] root INFO: 174.0, 458.0], [397.0, 449.0], [398.0, 480.0],u 
o6[175.0, 489.0]] 

2021/12/22 21:07:23] root INFO: 42.0, 414.0], [482.0, 392.0], [484.0, 428.0], [44. 
30, 450.0)] 


In addition, paddleocr also provides a code calling method: 


# 1. Import PaddleOCR class from paddleocr 
from paddleocr import PaddleOCR 


# 2. Declare the PaddleOCR class 

ocr = PaddleOCR () 

site joyeheiay. =" Ay/ ANZ) aerop! 

# 3. Perform inference 

result = ocr.ocr(img_path, rec=False) 

print (f"The predicted text box of {img_path} are follows.") 
print (result) 


The predicted text box of ./12.jpg are follows. 

LLIV9s0, 555.0], [398.0, 542.0], [399.0,;, S71.0)], [80.0, 584.0] )], [l21.0,; 507.0], [sl2. 
360, 491.0], [513.0, 532.0], [22.0, 548.0]], [[174.0, 458.0], [397.0, 449.0], [398.0,u 
6480.0], [175.0, 489.0]], [[42.0, 414.0], [482.0, 392.0], [484.0, 428.0], [44.0, 450. 
30) )] 


Visualize results of text detection inference 


import numpy as np 

import cv2 

import matplotlib.pyplot as plt 

# When using matplotlib.pyplot to draw in the notebook, you need to add this command. 
oto display 

smatplotlib inline 


# 4. Visual inspection results 
image = cv2.imread(img_path) 
boxes = [line[0] for line in result] 
for box in result: 
box = np.reshape(np.array(box), [-1, 1, 2]).astype(np.int64) 
image = cv2.polylines(np.array(image), [box], True, (255, 0, 0), 2) 


# Draw the read picture 


(continues on next page) 
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(continued from previous page) 


plt.figure(figsize=(10, 10)) 

plt.imshow (image) 
4.2.2 Detailed Implementation of DB Text Detection Algorithm 
The Principle of DB Algorithm 


DB is a segmentation-based text detection algorithm, using a Differentiable Threshold Differentiable Binarization module 
(DB module) to distinguish the text region from the background with a dynamic threshold. 


image segmentation map binarizationmap detection results 


threshold map 


Figure 2: Traditional pipeline (blue flow) and our pipeline 
(red flow). Dashed arrows are the inference only operators; 
solid arrows indicate differentiable operators in both training 
and inference. 


Figure 2: The difference between DB model and others 


The blue arrows in the picture show the process of the common segmentation-based text detection algorithms. This kind 
of methods use a fixed threshold to gain the binarized segmentation map after segmentation, and then adopt heuristic 
algorithms such as pixel clustering are used to get the text region. 


The red arrows in the picture show the flow of the DB algorithm. The biggest difference from common solutions is that 
DB has a threshold map, and it will which predict the threshold at every pixel point of the picture through the network, 
instead of designating a fixed value. So it can better distinguish the text background and the foreground. 


The DB algorithm is advantageous for: 
1. Its algorithmic structure is simple and free of tedious post-processing 
2. Its open-source data have good accuracy and performance 


After gaining the probability map, the traditional algorithm based on image segmentation will set all the pixels below t to 
0 and to 1 otherwise. The formula is: 


1ifP,,>=t, 
B; ; -{ f at 


0, otherwise. 
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But the standard binarization method is not differentiable, thus making the network unable to to be trained end-to-end. 
To solve this problem, the DB algorithm uses Differentiable Binarization (DB), which approximates the step function of 
standard binarization. It uses another formula: 

BK 1 

de MPa) 


P refers to the probability map above, T refers to the threshold map, and k is the gain factor which is set to 50 under 
a rule of thumb in the experiment. The Figure 2(a) below shows the differences between standard binarization and 
differentiable binarization. 


When using cross-entropy loss, the loss of positive and negative samples is |, and /_ respectively: 


1 
I + e R(Fi,5-Ti,5) ) 


L, = —log( 


l_ = —log(1 


Inputting x to take the partial derivative may result in: 


me = —kf(z)e"* 
ol 
aa —kf (a) 


It can be found that the gain factor will magnify the gradient of the error prediction, thereby optimizing the model to 
obtain better results. In Figure 2(b), the part of x < 0 represents the case where positive samples are predicted to be 
negative samples. It can be seen that the gain factor k magnifies the gradient. Figure 2(c) shows x > 0 which refers to 
the case where negative samples are predicted to be positive samples, and the gradient is also magnified. 


1.0 4 | eee DB 
=a 55 
—1.0 -0.5 0.0 0.5 1.0 
0.5 
0.0 
-1.0 -0.5 0.0 0.5 1.0 —1.0 —0.5 0.0 0.5 1.0 


(a) (c) 
Figure 3: Schematic of DB algorithm 


The overall structure of the DB algorithm is: 
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Figure 4: Schematic of DB model network structure 


Features of the input image is extracted through the network Backbone and FPN, and they are cascaded together to get 
a feature whose size is a quarter of the original image. Then, the convolutional layer is used to obtain the inference 


probability map and the threshold map, and then get the outlining curve through the post-processing of DB. 


Building the DB Text Detection Model 


The DB text detection model consists of three parts: 
¢ Backbone network for extraction of image features 
e FPN (Feature Pyramid Network) for feature enhancement 
¢ Head network for calculation of the probability map of the text region 
In this section, PaddlePaddle will implement the three network modules and build a complete network. 


Backbone Network 


The backbone of the DB network uses an image classification network. ResNet50 is used in the paper. In this section, to 


speed up the training, the experiment will use the MobileNetV3 large as the backbone. 


# For the first run, you need to open the comment on the next line and download the. 


«PaddleOCR code 
#!git clone https://gitee.com/paddlepaddle/PaddleOCR 
import os 


# Modify the default directory where the code runs to /home/aistudio/PaddleOCR 


os.chdir ("/home/aistudio/PaddleOCR") 

# Install PaddleOCR third-party dependencies 
!pip install --upgrade pip 

!pip install -r requirements.txt 


# https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.4/ppocr/modeling/ 


obackbones/det_mobilenet_v3.py 
from ppocr.modeling.backbones.det_mobilenet_v3 import MobileNetv3 


If you want to use ResNet as Backbone, you can select ResNet in the PaddleOCR code, or select your own backbone 


model in PaddleClas. 


DB’s Backbone is used to extract the multi-scale features of the image, as shown in the following code. Assume the input 
shape is [640, 640], and the output of the backbone network has four features whose shapes are [1, 16, 160, 160] , [1, 24, 
80, 80], [1, 56, 40, 40], and [1, 480, 20, 20]. These features will be input to the FPN network for further enhancement. 
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import paddle 


fake_inputs = paddle.randn([1, 3, 640, 640], 
# 1. Declare Backbone 
model_backbone = MobileNetV3 () 


model_backbone.eval () 


# 2. Perform inference 
outs = model_backbone (fake_inputs) 


# 3. Print network structure 
print (model_backbone) 


# 4. Print out shapes of the features 
for idx, out in enumerate(outs): 
joreagted( ables) Suinelere abe; Wi akebc, 
FPN 


"and the shape of output is 


dtype="float32") 


", out.shape) 


Feature Pyramid Network, or FPN, is commonly used in convolutional networks to efficiently extract features of each 


dimension of an image. 


# https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2. 4/ppocr/modeling/necks/ 


«db_fpn.py 


import paddle 

from paddle import nn 

import paddle.nn.functional as F 
from paddle import ParamAttr 


class DBFPN(nn.Layer): 
def init__(self, in_channels, 
super (DBFPN, self) .__init__() 
self.out_channels = out_channels 


# For detailed implementation of DBFPN, 


out_channels, 


**kwargs) : 


refer to: https://github.com/ 


«PaddlePaddle/PaddleOCRblob/release%2F2.4/ppocr/modeling/necks/db_fpn.py 


def forward(self, x): 


e2, cs, c4, cs = = 

in5e— (Seikeseimomconwmie 5) 
in4 = self.in4_conv(c4) 
in3 = self.in3_conv(c3) 
in2g = self.in2_ conv (c2) 


# Feature upsampling 


out4 = in4 + F.upsample ( 

in5, scale_factor=2, mode="nearest", align_mode=1) sig hy) 
out3 = in3 + F.upsample ( 

out4, scale_factor=2, mode="nearest", align_mode=1) ce VAS 
out2 = in2 + F.upsample ( 

out3, scale_factor=2, mode="nearest", align_mode=1) Hele A 
pS = self. pS conv, (and) 
p4 = self.p4_conv(out4) 


(continues on next page) 
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p3 = self.p3_conv(out3) 
p2 = self.p2_conv(out2) 


# Feature upsampling 


p5 = F.upsample(p5, scale_factor=8, mode="nearest", align_mode=1) 
p4 = F.upsample(p4, scale_factor=4, mode="nearest", align_mode=1) 
p3 = F.upsample(p3, scale_factor=2, mode="nearest", align_mode=1) 


fuse = paddle.concat([p5, p4, p3, p2], axis=1) 
return fuse 


The input of the FPN is the output of the backbone, and the height and width of the output feature image are a quarter of 
the original image in size. Assuming that the shape of the input image is [1, 3, 640, 640], the height and width of output 
feature in the FPN are [160, 160]. 


import paddle 


# 1. Import DBFPN from PaddleOCR 
from ppocr.modeling.necks.db_fpn import DBFPN 


# 2. Obtain Backbone network output results 

fake_inputs = paddle.randn([1, 3, 640, 640], dtype="float32") 
model_backbone = MobileNetV3 () 

in_channles = model_backbone.out_channels 


# 3. Declare FPN network 
model_fpn = DBFPN(in_channels=in_channles, out_channels=256) 


# 4. Print FPN network 
print (model_fpn) 


# 5. Calculate FPN result output 
outs = model_backbone (fake_inputs) 
fpn_outs = model_fpn (outs) 


# 6. Print the shape of the FPN output feature 
print (f"The shape of fpn outs {fpn_outs.shape}") 


Head Network 


Acquire the text region probability map, the text region threshold map and the binary map through calculation. 


import math 

import paddle 

from paddle import nn 

import paddle.nn.functional as F 
from paddle import ParamAttr 


class DBHead(nn.Layer): 
noe 
Differentiable Binarization (DB) for text detection: 
see https://arxiv.org/abs/1911.08947 
args: 


params (dict): super parameters for build DB network 
coe 


(continues on next page) 
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def init__(self, in_channels, k=50, **kwargs): 
super (DBHead, self).__init__() 
self.k =k 


# For detailed implementation of DBHead, refer to: https://github.com/ 
«PaddlePaddle/PaddleOCR/blob/release%2F2.4/ppocr/modeling/heads/det_db_head.py 


def step_function(self, x, y): 
# Differentiable binarization can be realized, and text segmentation. 
obinarization graph is calculated with the probability map and the threshold map 
return paddle.reciprocal(1 + paddle.exp(-self.k * (x - y))) 


def forward(self, x, targets=None) : 


shrink_maps = self.binarize (x) 
if not self.training: 
return {'maps': shrink_maps} 


threshold_maps = self.thresh (x) 

binary_maps = self.step_function(shrink_maps, threshold_maps) 

y = paddle.concat ([shrink_maps, threshold_maps, binary_maps], axis=1) 
return {'maps': y} 


The DB Head network will perform up-sampling on the basis of the FPN feature, and map the size of the feature from a 
quarter to the same size as the original image. 


# 1. Imort DBHead from PaddleOCR 
from ppocr.modeling.heads.det_db_head import DBHead 
import paddle 


# 2. Calculate DBFPN network output results 

fake_inputs = paddle.randn([1, 3, 640, 640], dtype="float32") 
model_backbone = MobileNetV3 () 

in_channles = model_backbone.out_channels 

model_fpn = DBFPN(in_channels=in_channles, out_channels=256) 
outs = model_backbone (fake_inputs) 

fpn_outs = model_fpn (outs) 


# 3. Declare Head network 
model_db_head = DBHead(in_channels=256) 


# 4. Print DBhead network 
print (model_db_head) 


# 5. Calculate the output of the Head network 

db_head_outs = model_db_head(fpn_outs) 

print (f"The shape of fpn outs {fpn_outs.shape}") 

print (£f"The shape of DB head outs {db_head_outs['maps'] .shape}") 
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4.2.3 Training the DB Text Detection Model 


DB text detection algorithm is available on PaddleOCR, as well as two backbone networks, MobileNetV3 and 
ResNet50_vd. You can select the configuration file and start training according to your needs. 


This section takes DB detection model whose backbone network is the icdar15 dataset or MobileNetV3 (the configuration 
used in the ultra-lightweight model)as an example to introduce how to train, evaluate and test text detection models of 
PaddleOCR. 


Data Preparation 


This experiment selects ICDAR2015, the most well-known and commonly-used dataset in Scene Text Detection and 
Recognition. The schematic of the ICDAR2015 dataset is shown below: 


a a 


Figure 5: Schematic of the [CDAR2015 dataset 


The ICDAR2015 dataset has been downloaded in this project and stored in /home/aistudio/data/data96799. You can run 
the following command to decompress the data set, or download it from the link. 


'cd ~/data/data96799/ && tar xf icdar2015.tar 


After running the command, there are two folders and two files in ~/train_data/icdar2015/text_localization : 


~/train_data/icdar2015/text_localization 

'— Training data of icdar_c4_train_imgs/ icdar dataset 

'— ch4_test_images/ test data of icdar dataset 

— train_icdar2015_label.txt icdar dataset training label 
— test_icdar2015_label.txt icdar dataset test label 


The format of the annotation file provided is: 
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"Image file name json.dumps encoded image label information" 
ch4_test_images/img_61.jpg [{"transcription": "MASA", "points": [[310, 104], [416,. 
e141], [418, 216], [312, 1791], «.+}] 


The image label information before the encoding of json.dumps is a list containing multiple dictionaries. The points in 
the dictionaries represent the coordinates (x, y) of the four points of the text box, which are arranged clockwise from the 
point in the upper left. The field in transcription represents the text in the current text box, and this information is not 
needed in the text detection. If you want to train PaddleOCR on other datasets, you can construct the annotation file in 
the above format. 


If the text in the “transcription” is “*’ or “###?, it means that the corresponding label can be ignored. Therefore, if there is 
no text label, the transcription field can be set to an empty string. 


Data Preprocessing 


During training, there are certain requirements for the format and size of the input image. At the same time, it is also 
necessary to obtain true labels of the threshold map and the probability map according to the label information. Therefore, 
before inputting the data to the model, the data needs to be preprocessed to make pictures and labels meet the needs of 
network training and inference. In addition, in order to expand the training dataset, suppress over-fitting, and improve the 
generalization ability of the model, we also use several basic data augmentation methods here. 


The methods of preprocessing data in this experiment include: 

¢ Image decoding: to convert the image to Numpy format; 

¢ Label encoding: to parse the label information in txt files and save them in a unified format; 

e Basic data augmentation: involving random horizontal flip, random rotation, random zoom, random crop, etc.; 

¢ Acquisition of threshold map labels: to obtain the threshold map labels required in algorithm training through 
expansion; 

¢ Acquisition of probability map labels: to obtain the probability map labels required in algorithm training through 
shrinking; 

¢ Normalization: Through normalization, the input value distribution of any neuron in each layer of the neural net- 


work is changed to a standard normal distribution with a mean value of 0 and a variance of 1. So, the optimization 
of the optimal solution will obviously become smooth and the training process is easier to converge; 


¢ Channel transformation: The image data format is [H, W, C] ( height, width, and channel number), and the training 
data format used by the neural network is [C, H, W], so the image data needs to be rearranged, such as changing 
[224, 224, 3] to [3, 224, 224]; 


Image Decoding 


import sys 
import six 
import cv2 
import numpy as np 


# https://github.com/PaddlePaddle/PaddleOCR/blob/releaset2F2. 4/ppocr/data/imaug/ 
soperators.py 
class DecodeImage (object): 

uum decode image """ 


def _ init__(self, img_mode='RGB', channel_first=False, **kwargs) : 
self.img_mode = img_mode 
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self.channel_ first = channel _first 


def call (self, data): 
img = data['image'] 
pipe Flaked Je Neng 
assert type(img) is str and len( 
img) > 0, “invalid input 'img' in DecodeImage" 
else: 
assert type(img) is bytes and len( 
img) > 0, “invalid input 'img' in DecodeImage" 
# 1. Image decoding 
img = np.frombuffer(img, dtype='uint8') 
img cv2.imdecode(img, 1) 


if img is None: 
return None 


if self.img_mode == 'GRAY': 
img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR) 

elif self.img_mode == 'RGB': 
assert img.shape[2] == 3, ‘invalid shape of image[%s]' % (img.shape) 
ic, = ame, 8, 3 eal] 


if self.channel_first: 
img = img.transpose((2, 0, 1)) 
# 2. The decoded image is placed in the dictionary 
data['image'] = img 
return data 


Next, read the image from labels of the training data to demonstrate the use of the DecodeImage class. 


import json 

import cv2 

import os 

import numpy as np 

import matplotlib.pyplot as plt 

# When using matplotlib.pyplot to draw in the notebook, you need to add this command. 
efor displaying 

smatplotlib inline 

from PIL import Image 

import numpy as np 


label_path = "/home/aistudio/data/data96799/icdar2015/text_localization/train_ 
eicdar2015_label.txt" 
img_dir = "/home/aistudio/data/data96799/icdar2015/text_localization/" 


# 1. Read the first data of the training labels 
f = open(label_path, "r") 
lines = f.readlines () 


# 2. Fetch the first data 
line = lines[0] 


print ("The first data in train_icdar2015_label.txt is as follows.\n", line) 
img_name, gt_label = line.strip().split("\t") 
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# 3. Read the image 
image = open(os.path.join(img_dir, img_name), 'rb').read() 
data = {'image': image, 'label': gt_label} 


Declare the DecodeImage class, decode the image, and return a new dictionary data. 


# 4. Declare the DecodeImage class to decode the image 
decode_image = DecodeImage (img_mode='RGB', channel_first=False) 
data = decode_image (data) 


# 5. Print the shape of the decoded image and visualize the image 
print ("The shape of decoded image is ", data['image'] .shape) 


plt.figure(figsize=(10, 10)) 
plt.imshow(data['image']) 
src_img = data['image'] 


Label Decoding 
Analyze the label information in txt files and save it in a unified format. 
import numpy as np 


import string 
import json 


# For detailed implementation, refer to: https://github.com/PaddlePaddle/PaddleOCR/ 
oblob/release%2F2.4/ppocr/data/imaug/label_ops.py#L38 
class DetLabelEncode (object): 
defini ema(selie et Skewers): 
pass 


def call (self, data): 
label = data['label'] 
# 1. Use json to read tags 
label = json.loads (label) 
nBox = len(label) 
boxes, :Exts, txtotags — [17 ll, 
for bno in range(0, nBox): 
box = label[bno] ['points'] 
Ext label [bno] ['transcription'] 
boxes. append (box) 
txts.append (txt) 
# 1.1 If the text label is * or ###, it means this label is invalid 
ee tect ray |e re ttt | os 
txt_tags.append (True) 
else: 
txt_tags.append (False) 
if len(boxes) == 0: 
return None 
boxes = self.expand_points_num (boxes) 
boxes = np.array(boxes, dtype=np.float32) 
txt_tags = np.array(txt_tags, dtype=np.bool) 


# 2. Get text, box, and other information 
data['polys'] = boxes 
data['texts'] = txts 
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data['ignore_tags'] = txt_tags 
return data 


Run the following code to observe the difference between the pre-decoding and post-decoding of labels by the DetLabe- 
IEncode class. 


# Import DetLabelEncode from PaddleOCR 
from ppocr.data.imaug.label_ops import DetLabelEncode 


# 1. Declare the class for label decoding 
decode_label = DetLabelEncode () 


# 2. Print the label before decoding 
print ("The label before decode are: ", data['label']) 


# 3. Label decoding 
data = decode_label (data) 
print" Vn!) 


# 4. Print the decoded label 
print ("The polygon after decode are: ", data['polys']) 
print ("The text after decode are: ", data['texts"]) 


Basic Data Augmentation 


Data augmentation is commonly used to improve training accuracy and generalization of the model. Common data aug- 
mentation for text detection includes random horizontal flipping, random rotation, random scaling, and random cropping. 


The code example of random horizontal flipping, rotation, and zoom: Code. The code example of random cropping data 
augmentation: Code. 


Acquisition of Threshold Map Labels 


Get the threshold map labels required in algorithm training through expansion; 


import numpy as np 
import cv2 


np.seterr (divide='ignore', invalid='ignore') 
import pyclipper 

from shapely.geometry import Polygon 

import sys 

import warnings 


warnings.simplefilter ("ignore") 


# Calculate the threshold map label class of the text region 
# For detailed implementation code, refer to: https://github.com/PaddlePaddle/ 
«+PaddleOCR/blob/release%2F2.4/ppocr/data/imaug/make_border_map.py 
class MakeBorderMap (object): 
def init (self, 
shrink_ratio=0.4, 
thresh_min=0.3, 
thresh_max=0.7, 
**kwargs): 
self.shrink_ratio = shrink_ratio 
self.thresh_min = thresh_min 
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self.thresh_max = thresh_max 
__call__(self, data): 


img = data['image'] 
text_polys = data['polys'] 
ignore_tags = data['ignore_tags'] 


# 1. Generate an empty template 
canvas = np.zeros(img.shape[:2], dtype=np.float32) 
mask = np.zeros(img.shape[:2], dtype=np.float32) 


for i in range(len(text_polys)): 
if ignore_tags[i]: 


continue 


# 2. The draw_border_map function calculates the threshold map labels. 


obased on the decoded box information 
self.draw_border_map (text_polys[i], canvas, mask=mask) 
canvas = canvas * (self.thresh_max - self.thresh_min) + self.thresh_min 
data['threshold_map'] = canvas 
data['threshold_mask'] = mask 
return data 
def draw_border_map(self, polygon, canvas, mask): 
polygon = np.array (polygon) 
assert polygon.ndim == 
assert polygon.shape[1] == 2 
polygon_shape = Polygon (polygon) 
if polygon_shape.area <= 0: 
return 
# Polygon indentation 
distance = polygon_shape.area * ( 
1 - np.power(self.shrink_ratio, 2)) / polygon_shape.length 
subject = [tuple(1) for 1 in polygon] 
padding = pyclipper.PyclipperOffset () 
padding.AddPath(subject, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON) 
# Calculate the mask 
padded_polygon = np.array (padding.Execute (distance) [0]) 
cev2.fillPoly(mask, [padded_polygon.astype(np.int32)], 1.0) 
xmin = padded_polygon[:, 0].min() 
xmax = padded_polygon[:, 0].max() 
ymin = padded_polygon[:, 1].min() 
ymax = padded_polygon[:, 1].max() 
width xmax xia dl 
height = ymax - ymin + 1 
polygonik:,, Ol = polygon) (0) = xmain 
polygon = polly.geni i sy) syimen 
xS = np.broadcast_to ( 
np. linspace ( 
0, width — 1, num=width).reshape(1, width), (height, width) ) 
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ys = np.broadcast_to ( 
np. linspace ( 
0, height — 1, num=height).reshape(height, 1), (height, width) ) 


distance_map = np.zeros ( 
(polygon.shape[0], height, width), dtype=np.float32) 
for i in range(polygon.shape[0]): 
3 = (i + 1) % polygon.shape[0] 
# Calculate the distance from point to line 
absolute_distance = self._distance(xs, ys, polygon[i], polygon[j]) 
distance_map[i] = np.clip(absolute_distance / distance, 0, 1) 
distance_map = distance_map.min(axis=0) 


| = dy 
| = ay 
] 


xmin_valid = min(max(0, xmin), canvas.shape[ 
xmax_valid = min(max(0, xmax), canvas.shape[ 
ymin_valid = min(max(0, ymin), canvas.shape[ = 1) 
ymax_valid = min(max(0, ymax), canvas.shape[0] - 1) 
canvas [ymin_valid:ymax_valid + 1, xmin_valid:xmax_valid + 1] = np.fmax ( 
1 -— distance_map[ymin_valid - ymin:ymax_valid - ymax + height, 
xmin_valid -— xmin:xmax_ valid — xmax + width], 


canvas [ymin_valid:ymax_valid + 1, xmin_valid:xmax_valid + 1]) 


# Import MakeBorderMap from PaddleOCR 
from ppocr.data.imaug.make_border_map import MakeBorderMap 


# 1. Declare the MakeBorderMap function 
generate_text_border = MakeBorderMap () 


# 2. Calculate bordermap information based on the decoded input data 
data = generate_text_border (data) 


# 3. Visualize the threshold map 
t.figure(figsize=(10, 10)) 

t .imshow (src_img) 

text_border_map = data['threshold_map' ] 
.figure(figsize=(10, 10)) 

. imshow (text_border_map) 


Acquisition of Probability Map Labels 
Get the probability map labels needed in algorithm training through shrinking. 


import numpy as np 

import cv2 

from shapely.geometry import Polygon 
import pyclipper 


# Calculate the probability map labels 
# For the code example, refer to: https://github.com/PaddlePaddle/PaddleOCR/blob/ 
oreleaset2F2.4/ppocr/data/imaug/make_shrink_map.py 
class MakeShrinkMap (object): 
je 
Making binary mask from detection data with ICDAR format. 
Typically following the process of class ~“MakeICDARData. 
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__init__(self, min_text_size=8, shrink_ratio=0.4, **kwargs) : 
self.min_text_size = min_text_siz 

self.shrink_ratio = shrink_ratio 

__call__(self, data): 


image = data['image'] 
text_polys = data['polys'] 
ignore_tags = data['ignore_tags'] 


h, w = image.shape[:2] 

# 1. Verify text detection labels 

text_polys, ignore_tags = self.validate_polygons (text_polys, 
ignore_tags, h, w) 

gt = np.zeros((h, w), dtype=np.float32) 

mask = np.ones((h, w), dtype=np.float32) 


# 2. Calculate the probability map of the text region according to the textu 


odetection box 


for i in range(len(text_polys)): 

polygon = text_polys[i] 

height = max(polygon[:, 1]) - min(polygon[:, 1]) 

width = max(polygon[:, 0]) - min(polygon[:, 0]) 

if ignore_tags[i] or min(height, width) < self.min_text_size: 
cv2.fi11Poly (mask, 

polygon.astype(np.int32) [np.newaxis, :, :], 0) 

ignore_tags[i] = True 

else: 
# Polygon indentation 
polygon_shape = Polygon (polygon) 
subject = [tuple(1) for 1 in polygon] 
padding = pyclipper.PyclipperOffset () 
padding.AddPath(subject, pyclipper.JT_ROUND, 

pyclipper .ET_CLOSEDPOLYGON) 


shrinked = [] 


# Increase the shrink ratio every time we get multiple polygon. 


oreturned back 


possible_ratios = np.arange(self.shrink_ratio, 1, 
self.shrink_ratio) 
np.append(possible_ratios, 1) 
# print (possible_ratios) 
for ratio in possible_ratios: 
# print (f' Change shrink ratio! to! {ratio} ™) 
distance = polygon_shape.area * ( 
1 — np.power(ratio, 2)) / polygon_shape.length 
shrinked = padding.Execute (-distance) 
if len(shrinked) == 
break 


if shrinked == []: 
cv2.fillPoly (mask, 
polygon.astype(np.int32) [np.newaxis, :, :], 0) 
ignore_tags[i] = True 
continue 
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Ht aieeel ier Ch 

for each_shrink in shrinked: 
shrink = np.array(each_shrink).reshape(-1, 2) 
eve, Pill Polvilgr, [ishrankvastype (nents) |, 1) 


data['shrink_map'] = gt 
data['shrink_mask'] = mask 
return data 


# Import MakeShrinkMap from PaddleOCR 
from ppocr.data.imaug.make_shrink_map import MakeShrinkMap 


# 


1. Declare the generation of text probability map labels 


generate_shrink_map = MakeShrinkMap () 


# 2. Calculate the probability map of the text region based on the decoded labels 


da 


ta = generate_shrink_map (data) 


.figure(figsize=(10, 10)) 


3. Visualize of Probability Map of text region 
t.imshow (src_img) 


text_border_map = data['shrink_map'] 


Nor 


t.figure(figsize=(10, 10)) 
t.imshow (text_border_map) 
malization 


Through normalization, the input value distribution of any neuron in each layer of the neural network is changed to a 
standard normal distribution with a mean value of 0 and a variance of 1. So, the optimization of the optimal solution will 
obviously become smooth and the training process is easier to converge. 


# 


Image normalization 


class NormalizeImage (object): 


"il! normalize image such as substract mean, divide std 
noe 


def __init__ (self, scale=None, mean=None, std=None, order='chw', **kwargs): 
LE vsinstanece (scale, str)? 
scale = eval (scale) 
self.scale = np.float32(scale if scale is not None else 1.0 / 255.0) 
# 1. Get the normalized mean and variance 
mean = mean if mean is not None else [0.485, 0.456, 0.406] 
std = std if std is not None else [0.229, 0.224, 0.225] 


shape = (3, 1, 1) if order == 'chw' else (1, 1, 3) 
self.mean = np.array (mean) .reshape(shape) .astype('float32') 
self.std = np.array(std) .reshape (shape) .astype('float32"') 


def call (self, data): 
# 2. Get image data from dictionaries 
img = data['image'] 
from PIL import Image 
if isinstance(img, Image.Image) : 
img = np.array (img) 
assert isinstance(img, np.ndarray), "invalid input 'img' in NormalizeImage" 
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# 3. Image normalization 
data['image'] = (img.astype('float32') * self.scale - self.mean) / self.std 


return data 


Channel Transformation 


The image data format is [H, W, C] ( height, width, and channel number), and the training data format used by the neural 
network is [C, H, W], so the image data needs to be rearranged, such as changing [224, 224, 3] to [3, 224, 224]. 


# Change the channel order of the image, HWC to CHW 
class ToCHWImage (object): 
""™" convert hwc image to chw image 
coe 
def init__(self, **kwargs): 
pass 


def call (self, data): 
# 1. Get image data from dictionary 
img = data['image'] 
from PIL import Image 
if isinstance(img, Image.Image) : 
img = np.array (img) 


# 2. Change the channel order of the image by transposing 
data['image'] = img.transpose((2, 0, 1)) 
return data 


# 1. Declare the channel transformation class 
transpose = ToCHWImage () 


# 2. Print the image before transformation 
print ("The shape of image before transpose", data['image'].shape) 


# 3. Image channel transformation 
data = transpose (data) 


# 4. Print the transformed image to the channel 
print ("The shape of image after transpose", data['image'].shape) 


Building a Dataloader 


The above code only shows how to read one picture and preprocess it. But in the actual model training, usually a batch 
of data are read and processed simultaneously. 


This section uses Dataset and DatasetLoader APIs to build a dataloader. 


# For detailed code for building dataloader, refer to: https://github.com/ 
PaddlePaddle/PaddleOCR/blob/release%2F2.4/ppocr/data/simple_dataset.py 


import numpy as np 
import os 
import random 
from paddle.io import Dataset 
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def transform(data, ops=None) : 
WUE Teaeeiarsieropen “AM: 
if ops is None: 
cise = (I 
for op in ops: 
data = op(data) 
if data is None: 
return None 
return data 


def create_operators(op_param_list, global_config=None) : 
mone 
create operators based on the config 
ADGS £ 
params (list): a dict list, used to create some operators 
moe 
assert isinstance(op_param_list, list), ('operator config should be a list') 
ops = [] 
for operator in op_param_list: 
assert isinstance (operator, 


dict) and len(operator) == 1, "yaml format error" 
op_name = list (operator) [0] 
param = {} if operator[op_name] is None else operator[op_name] 


if global_config is not None: 
param.update (global_config) 
op = eval (op_name) (**param) 
ops.append (op) 
return ops 


class SimpleDataSet (Dataset): 
def init__(self, mode, label_file, data_dir, seed=None): 
super (SimpleDataSet, self).__init__() 
# In the label file, use'\t' as a separator to distinguish the image name. 
ofrom the label 
self.delimiter ='\t' 
# Dataset path 
self.data_dir = data_dir 
# Random number seed 
self.seed = seed 
# Get all the data and return them in a list 
self.data_lines = self.get_image_info_list (label_file) 
# Create a new list to store data index 
self.data_idx_order_list = list (range(len(self.data_lines) ) ) 
self.mode = mode 


# If it is a training process, randomly shuffle the dataset 
if self.mode.lower() == "train": 
self.shuffle_data_random() 


def get_image_info_list(self, label_file): 
# Get all the data in the label file 
with open(label_file, "rb") as f: 
lines = f.readlines () 
return lines 
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shuffle_data_random(self): 
#Randomly shuffle the data 
random.seed(self.seed) 
random.shuffle(self.data_lines) 
return 


__getitem__(self, idx): 
# 1. Get the data whose index is idx 


file_idx = self.data_idx_order_list [idx] 
data_line = self.data_lines [file_idx] 
try: 


# 2. Get the image name and label 
data_line = data_line.decode('utf-8') 
substr = data_line.strip("\n") .split (self.delimiter) 
file_name = substr[0] 
label = substr[1] 
# 3. Get the image path 
img_path = os.path.join(self.data_dir, file_name) 
data = {'img_path': img_path, 'label': label} 
if not os.path.exists(img_path) : 
raise Exception("{} does not exist!".format (img_path) ) 
# 4. Read the picture and preprocess it 
with open(data['img_path'],'rb') as f: 
img = f.read() 
data['image'] = img 


# 5. Complete the data agumentation 
outs = transform(data, self.mode.lower() ) 


# 6. If it fails to read the current data, read another randomly 


except Exception as e: 
outs = None 
if outs is None: 
return self.__getitem__(np.random.randint (self.__len__())) 


return outs 


__len__ (self): 
# Return the size of the dataset 
return len(self.data_idx_order_list) 


PaddlePaddle’s Dataloader API can read data using multi-processing, and can set freely the number of threads. Using 
multiple threads to read data can accelerate data processing and model training. And its code is as follows: 


from paddle.io import Dataset, DataLoader, BatchSampler, DistributedBatchSampler 


def build_dataloader(mode, label_file, data_dir, batch_size, drop_last, shuffle, num_ 
workers, seed=None) : 


# Create data reading class 
dataset = SimpleDataSet (mode, label_file, data_dir, seed) 


# Define batch_sampler 
batch_sampler = BatchSampler (dataset=dataset, batch_size=batch_size,u 


shuffle=shuffle, drop_last=drop_last) 


# Use paddle.io.DataLoader to build a dataloader, and set batchsize, the number. 


sof processes num_workers, and other parameters 
data_loader = DataLoader (dataset=dataset, batch_sampler=batch_sampler, num_ 


ee et 


1 a 


sworkers—num_workers, return_list-True, use_shared_memory—Falise) (continues on next page) 
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return data_loader 


ici5_data_path = "/home/aistudio/data/data96799/icdar2015/text_localization/" 
train_data_label = "/home/aistudio/data/data96799/icdar2015/text_localization/train_ 
sicdar2015_label.txt" 

eval_data_label = "/home/aistudio/data/data96799/icdar2015/text_localization/test_ 


sicdar2015_label.txt" 


# Define the training set dataloader, and the number of processes is set to 8 

train_dataloader = build_dataloader('Train', train_data_label, ic1i5_data_path, batch_ 
osize=8, drop_last=False, shuffle=True, num_workers=0) 

# Define validation set dataloader 

eval_dataloader = build_dataloader('Eval', eval_data_label, ici5_data_path, batch_ 
oSize=1, drop_last=False, shuffle=False, num_workers=0) 


DB Model Post-processing 


The output shape of the DB head network is the same as the original image. In fact, the three channel features output by 
the DB head network are the probability map, the threshold map and the binary map. 


In the training, the three inference maps and the real labels together calculate the loss function and train the model. 


In the inference, only the probability map is needed. Based on the response of the text area in the probability map, the 
DB post-processing function calculates the coordinates of the surrounding text box. 


Since the probability map predicted by the network is the result of shrinking, in the post-processing, you can get the text 
box by expanding the predicted polygon area with the same offset value. The code example is shown below. 


# https://github.com/PaddlePaddle/PaddleOCR/blob/releaset2F2.4/ppocr/postprocess/db_ 
opostprocess.py 


import numpy as np 

import cv2 

import paddle 

from shapely.geometry import Polygon 
import pyclipper 


class DBPostProcess (object): 


mon 


The post process for Differentiable Binarization (DB). 


mon 


det. Se initto (seit: 

thresh=0.3, 
box_thresh=0.7, 
max_candidates=1000, 
unclip_ratio=2.0, 
use_dilation=False, 
score_mode="fast", 
**xkwargs): 

# 1. Obtain post-processing hyperparameters 

self.thresh = thresh 
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self.box_thresh = box_thresh 
self.max_candidates = max_candidates 
self.unclip_ratio = unclip_ratio 


self.min_size = 3 
self.score_mode = score_mod 
assert score_mode in [ 
"slow", “fast” 
], "Score mode must be in [slow, fast] but got: {}".format (score_mode) 


self.dilation_kernel = None if not use_dilation else np.array ( 
a Sb et ha) 


# For detailed implementation of DB post-processing code, refer to: https:// 
ogithub.com/PaddlePaddle/PaddleOCR/blob/release%2F2.4/ppocr/postprocess/db_ 
opostprocess.py 


def _call__(self, outs_dict, shape_list): 


# 1. Get network inference results from the dictionary 
pred = outs_dict['maps'] 
if isinstance(pred, paddle.Tensor) : 
pred = pred.numpy () 
pred = pred[:, 0, :, :] 


# 2. Greater than the post-processing parameter threshold self.thresh 
segmentation = pred > self.thresh 


boxes_batch = [] 
for batch_index in range(pred.shape[0]): 
# 3. Get the shape of the original imageand and resize its ratio 
src_h, src_w, ratio_h, ratio_w = shape_list [batch_index] 
if self.dilation_kernel is not None: 
mask = cv2.dilate ( 
np.array (segmentation [batch_index]) .astype(np.uint8), 
self.dilation_kernel) 


else: 
mask = segmentation [batch_index] 


# 4. Use the boxes_from_bitmap function to calculate the text box from. 
the predicted text probability map 
boxes, scores = self.boxes_from_bitmap (pred[batch_index], mask, 
Src ow, Sre_i) 


boxes_batch.append({'points': boxes}) 
return boxes_batch 


You can find that each word is surrounded by a blue box. These blue boxes are obtained by perform post-processing on the 
segmentation results output by the DB. Add the following code to line 177 of PaddleOCR/ppocr/postprocess/ 
db_postprocess.py to visualize the segmentation map output by DB. The visualization result is saved as the image 
vis_segmentation.png. 


_maps = np.array(pred[0, :, :] * 255) .astype(np.uint8) 
import cv2 
cv2.imwrite("vis_segmentation.png", _maps) 
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# 1. Download the trained model 

'wget -nc -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/det_ 
SmVvicwdbinv-2y, Omeeratnede ae 

!cd ./pretrain_models/ && tar xf det_mv3_db_v2.0_train.tar && cd ../ 


# 2. Perform text detection inference to get the result 
!python tools/infer_det.py -c configs/det/det_mv3_db.yml \ 
-o Global.checkpoints=./pretrain_models/det_mv3_db_v2.0_ 


otrain/best_accuracy \ 
Global.infer_img=./doc/imgs_en/img_12.jpg 
#PostProcess.unclip_ratio=4.0 
# Note: For the introduction and usage of PostProcess parameters and Globalu 
parameters, refer to: https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/ 
«doc/doc_ch/config.md 


Visualize the text probability map predicted by the predictive model and the final result of the predicted text box. 


img = Image.open('./output/det_db/det_results/img_12.jpg') 
img = np.array (img) 


# Draw the read picture 
plt.figure(figsize=(10, 10)) 
plt.imshow (img) 


img = Image.open('./vis_segmentation.png') 
img = np.array (img) 


# Draw the read picture 
plt.figure(figsize=(10, 10)) 
plt .imshow (img) 


From the visualization results, it can be found that the output result of the DB is a binary image. As the response value of 
the text area is higher, that of the non-text background area is lower. The post-processing of DB is to find the minimum 
box of these response areas, and then get the coordinates of each text area. In addition, the size of the text box can be 
adjusted by modifying the post-processing parameters, or text boxs poor in detection can be filtered out. 


There are four parameters for DB post-processing: 
¢ thresh: The threshold for binarization of the segmentation map in DBPostProcess, and its default value is 0.3. 


¢ box_thresh: The threshold for filtering the output box in DBPostProcess. Boxes below this threshold will not be 
output. 


¢ unclip_ratio: The enlarging ratio of text boxes in DBPostProcess 


* max_candidates: The maximum number of text boxes output in DBPostProcess, and its default value is 1000 


# 3. Increase the unlip_ratio parameter of DB post-processing to 4.0, and the defaultu 
ois 1.5. Change the size of the output text box, and the parameters perform textu 
«detection inference to get the result 
!python tools/infer_det.py -c configs/det/det_mv3_db.yml \ 

-o Global.checkpoints=./pretrain_models/det_mv3_db_v2.0_ 


otrain/best_accuracy \ 
Global.infer_img=./doc/imgs_en/img_12.jpg \ 
PostProcess.unclip_ratio=4.0 
# Note: For the introduction and usage of PostProcess parameters and Globalu 
parameters, refer to: https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.4/ 
sdoc/doc_ch/config.md (continues on next page) 
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img = Image.open('./output/det_db/det_results/img_12.jpg') 
img = np.array (img) 


# Draw the read picture 
plt.figure(figsize=(10, 10)) 
plt.imshow (img) 


img = Image.open('./vis_segmentation.png') 
img = np.array (img) 


# Draw the read picture 
plt.figure(figsize=(10, 10)) 
plt.imshow (img) 


From the implementation results of the above code, it can be found that after increasing the unclip_ratio parameter of 
the DB post-processing, the predicted text box becomes much larger. Therefore, when the training result does not meet 
the expectation, it is feasible to adjust the post-processing parameters in order to refine the result. What’s more, you can 
try adjusting the other three parameters, thresh, box_thresh, max_candidates, and compare the corresponding detection 
results. 


Loss Function Definition 


Since three inference maps are obtained in the training, in the loss function, it is also necessary to combine these three 
maps and their real labels to build three parts of the loss function respectively. The formula of the whole loss function is 
defined as follows: 


L=L+axl,+8x i, 


L is the total loss, L, is the probability map loss. In this experiment, the Dice loss with OHEM (online hard example 
mining) is used. And L, is the threshold map loss. The L, distance between the predicted value and the label is used in 
this experiment. L, is the loss function of the text binary map. a and / are the weight coefficients, which are set to 5 and 
10 here. 


The three losses L,, L,, and L, refer to Dice Loss, Dice Loss (OHEM), and MaskL1 Loss. Next, define these three 
parts: 


¢ Dice Loss is to compare the similarity between the predicted text binary image and the label. It is often used in 
binary image segmentation. For code example, refer to link. The formula is as follows: 


2 x intersection_area 


dice_loss = 1 
total_area 


¢ Dice Loss (OHEM) uses Dice Loss with OHEM to improve the imbalance of positive and negative samples. OHEM 
is a special automatic sampling method that can automatically select difficult samples for loss calculation, thereby 
improving the training effect of the model. Here, the sampling ratio of positive and negative samples is set to 1:3. 
For the code example, refer to link. 


¢ MaskL1 Loss is to calculate the £, distance between the predicted text threshold map and the label. 


from paddle import nn 

import paddle 

from paddle import nn 

import paddle.nn.functional as F 


(continues on next page) 
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# DB loss function 
# For the code example, refer to: https://github.com/PaddlePaddle/PaddleOCR/blob/ 
orelease%2F2.4/ppocr/losses/det_db_loss.py 


class DBLoss 


mon 


(nn.Layer): 


Differentiable Binarization (DB) Loss Function 


args: 


param (dict): the super paramter for DB Loss 


mon 


def init (self, 


balance_loss=True, 
main_loss_type='DiceLoss', 


alpha=5, 

beta=10, 

ohem_ratio=3, 

eps=le-6, 

**xkwargs) : 
super (DBLoss, self).__init__() 
self.alpha = alpha 
self.beta = beta 
# Declare different loss functions 
self.dice_loss = DiceLoss (eps=eps) 
self.11_loss = MaskLiLoss (eps=eps) 
self.bce_loss = BalanceLoss ( 


balance_loss=balance_loss, 
main_loss_type=main_loss_type, 


negative_ratio=ohem_ratio) 


def forward(self, predicts, labels): 


predict_maps = predicts['maps'] 
label_threshold_map, label_threshold_mask, label_shrink_map, label_shrink_ 
omask = labels[ 


Le] 


shrink_maps = predict_maps[:, 0, :, :] 


thre 
bina 
# 1. 
SIKOVSNs! isbislele 


Sinoukclimejos = jorcchice meee, wl. 8, 8) 

ry maps’ = predict_maps|s, 2, <<, <1 

For the text inference probability map, use the binary cross-entropy. 
ion 


loss_shrink_maps = self.bce_loss(shrink_maps, label_shrink_map, 


iP 20 


label_shrink_mask) 
For the text inference threshold map, use the L1 distance loss function 


loss_threshold_maps = self.11_loss(threshold_maps, label_threshold_map, 


# 3. 
loss 


# 4. 
loss 
loss 


loss 


label_threshold_mask) 
For text inference binary graph, use the dice loss loss function 


_binary_maps = self.dice_loss(binary_maps, label_shrink_map, 


label_shrink_mask) 


Multiply different loss functions by different weights 


_shrink_maps = self.alpha * loss_shrink_maps 
_threshold_maps = self.beta * loss_threshold_maps 


_all = loss_shrink_maps + loss_threshold_maps \ 


+ loss_binary_maps 


(continues on next page) 
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losses = Wilosel se alesis ell; \ 
"loss_shrink_maps": loss_shrink_maps, \ 
"loss_threshold_maps": loss_threshold_maps, \ 
"loss_binary_maps": loss_binary_maps} 

return losses 


Index Evaluation 


Considering that the DB post-processing detection frame is diverse and not level, this experiment adopts a simple method 
of calculating IOU for evaluation. For the calculation code, refer to Text detection evaluation method of icdar Challenges 
4, 


There are three calculation indicators for text detection: Precision, Recall and Hmean. The calculation logic of the three 
indicators is: 


1. To create a matrix of size [n, m] called iouMat, where n is the number of GT (ground truth) boxes, m is the number 
of detected boxes, and n, m are the ### calibrated without the text Number of boxes. 


2. To count the number of IOUs greater than the threshold 0.5 in iouMat, and divide this value by the number n of gt 
to get Recall. 


3. To count the number of IOUs greater than the threshold 0.5 in iouMat, and divide this value by the number of 
detection frames m to get Precision. 


4. Hmean’s index calculation method is the same as that of Fl-score, and the formula is as follows: 


Precision « Recall 


H =: 
soa Precision + Recall 


The core code of text detection metric calculation is shown below. For the code example, refer to link{] 


# The calculation method of the text detection metric is as follows: 
# Complete code reference https://github.com/PaddlePaddle/PaddleOCR/blob/releaset2F2. 
o4/ppocr/metrics/det_metric.py 
if len(gtPols)> 0 and len(detPols)> 0: 
outputShape = [len(gtPols), len(detPols) ] 


# 1. Create a matrix of size [n, m] to save the calculated IOU 
iouMat = np.empty (outputShape) 
gtRectMat = np.zeros(len(gtPols), np.int8) 
detRectMat = np.zeros(len(detPols), np.int8) 
for gtNum in range(len(gtPols)): 
for detNum in range(len(detPols)): 
pG = gtPols[gtNum] 
pD = detPols[detNum] 


# 2. Calculate the IOU between the inference box and the GT box 
iouMat[gtNum, detNum] = get_intersection_over_union(pD, pG) 
for gtNum in range(len(gtPols)): 
for detNum in range(len(detPols)): 
if gtRectMat[gtNum] == 0 and detRectMat[ 
detNum] == 0 and gtNum not in gtDontCarePolsNum and detNum not in. 
sdetDontCarePolsNum: 


# 2.1 Count the number of IOUs greater than the threshold 0.5 
if iouMat[gtNum, detNum]> self.iou_constraint: 
gtRectMat[gtNum] = 1 


(continues on next page) 
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detRectMat [detNum] = 1 

detMatched += 1 

pairs.append({'gt': gtNum, 'det': detNum}) 
detMatchedNums. append (detNum) 


# 3. Divide the number of IOUs greater than the threshold 0.5 by the number of GTuU 
«boxes numGtcare to get the recall 
recall = float (detMatched) / numGtCare 


# 4. Divide the number of IOUs greater than the threshold 0.5 by the number ofu 
inference boxes numDetcare to get precision 
precision = 0 if numDetCare == 0 else float (detMatched) / numDetCare 


# 5. Calculate the Hmean indicator with formula 
hmean = 0 if (precision + recall) == 0 else 2.0 * \ 
precision * recall / (precision +. 


orecall) 


Questions: 


1. In the below situation where the IOU of the GT box and the inference box is greater than 0.5, but some texts have 
not been detected, can the above indicator calculation accurately reflect the accuracy of the model? 


2. When encountering such problems in the experiment, how to optimize the model? 


CI Giiz 
CI Wt 


Figure 6: Example of labeling of GT box and inference box 
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Model Training 


After data processing, network and loss function defining, you can start training the model. 


Training is based on PaddleOCR training, in the form of parameter configuration. For parameter files, refer to link. 
Network structure parameters are as follows: 


Architecture: 
model_type: det 
algorithm: DB 
Transform: 
Backbone: 

name: MobileNetv3 

scale: 0.5 

model_name: large 
Neck: 

name: DBFPN 

out_channels: 256 
Head: 

name: DBHead 

ks 50 


Parameters of the optimizer are: 


Optimizer: 
name: Adam 
betal: 0.9 
beta2: 0.999 
L#: 
learning_rate: 0.001 
regularizer: 
name: 'L2' 
Factor: 0 


Parameters of the post-processing are as follows: 


PostProcess: 
name: DBPostProcess 
thresh: 0.3 
box_thresh: 0.6 
max_candidates: 1000 
unclip_ratio: 1.5 


For the complete parameter configuration file, refer to det_mv3_db.yml. 


'mkdir train_data 

'cd train_data && ln -s /home/aistudio/data/data96799/icdar2015 icdar2015 

!wget -P ./pretrain_models/ https://paddle-imagenet-—models-name.bj.bcebos.com/dygraph/ 
o«MobileNetV3_large_x0_5_pretrained.pdparams 


!python tools/train.py -c configs/det/det_mv3_db.yml 


The model after network training is saved in PaddleOCR/output/db_mv3/ by default. If you want to change the 
directory, you can set the parameter Global.save_model_dir during training, such as: 
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# Set Global.save_model_dir in the parameter file to change the model directory 
python tools/train.py -c configs/det/det_mv3_db.yml -o Global.save_model_dir="./ 
soutput/save_db_train/" 


Model Evaluation 


During the training process, two models are saved by default: the latest trained model named latest, and the most accurate 
model named best_accuracy. Next, use the saved model parameters to evaluate the precision, recall, and hmean on the 
test set: 


The text detection accuracy evaluation code is in PaddleOCR/ppocr/metrics/det_metric.py. You can call 


tools/eval.py to evaluate the accuracy of the trained model. 


'python tools/eval.py -c configs/det/det_mv3_db.yml -o Global.checkpoints=./output/db_ 
omv3/best_accuracy 


Model Inference 


After training the model, you can also use the saved model to perform model inference on a certain picture or an image 
of a folder in the dataset, and observe the inference result. 


‘python tools/infer_det.py -c configs/det/det_mv3_db.yml -o Global.checkpoints=. / 
opretrain_models/det_mv3_db_v2.0_train/best_accuracy Global.infer_img=./doc/imgs_en/ 
Saget 5 Were) 


The predicted image is saved in ./output/det_db/det_results/. And the visualization with PIL is: 


import matplotlib.pyplot as plt 

# When using matplotlib.pyplot to draw in the notebook, you need to add this command. 
«for display 

smatplotlib inline 

from PIL import Image 

import numpy as np 


img = Image.open('./output/det_db/det_results/img_12.jpg') 
img = np.array (img) 


# Draw the read picture 
plt.figure(figsize=(20, 20)) 
plt.imshow (img) 


4.2.4 Text Detection FAQ 


This section talks about problems that developers often encounter when using PaddleOCR’s text detection model, and 
gives solutions or suggestions. 


The FAQ is introduced in two parts: 
* Questions related to model training in the text detection 


* Questions related to model inference in the text detection 
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FAQ about Model Training in the Text Detection 


1.1 What are the text detection algorithms provided by PaddleOCR? 


A: PaddleOCR contains a variety of text detection models, including regression-based text detection methods like EAST 
and SAST, and segmentation-based text detection methods like DB, PSENet. 


1.2 What datasets are used in the Chinese ultra-lightweight and generic models in the PaddleOCR project? How 
many samples are there? What is the configuration of GPUs? How many epochs were run, and how long did 
they run? 


A: For the ultra-lightweight DB detection model, the training data includes open-source datasets Isvt, retw, CASIA, CCPD, 
MSRA, MLT, BornDigit, iflytek, SROIE and synthetic datasets. The total data volume is 10W, The dataset is divided 
into 5 parts. In the training, samples are randomly picked. The training took about 500 epochs on a 4-card V1O0GPU, 
which took 3 days. 


1.3 Does the text detection training label require specific text labeling? What does the ‘“###” in the label mean? 


A: In the model training, only coordinates of text regions are required. The label can be in four or fourteen points, arranged 
in the order of upper left, upper right, lower right, and lower left. The label file provided by PaddleOCR contains text 
fields. For unclear text in the text area, ### will be used instead. In the model training of the text detection, the text field 
in the label will not be used. 


1.4 Is the training result of the text detection model bad when the layout of text lines is dense? 


A: In using segmentation-based methods, such as DB, you’d better collect a 

batch of data for training to detect dense text lines and turn down _ [shrink_ratio] 
(https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e 1363 1 dfb 1lac2 1d2095d4d4a4993ef7 10/ppocr/data/imaug/make_shrink_ma, 
repo-pjax-container, div[itemtype="http://schema.org/SoftwareSourceCode”] main, [data-pjax-container]#L37) which 

is to generate the binary map during the training.In addition, you can appropriately reduce unclipratio in model inference, 

for the larger the unclip_ratio parameter value is, the larger the detection frame is. 


1.5 For some large-sized document images, using DB will tend to miss more texts in detection. How to avoid this 
problem? 


A: First of all, you need to make sure that the missing detection is caused by the model training or image pre- 

diction. If the model is not well trained, input more data for training, or enhance data augmentation during 

training. If it is caused by the oversized predicted image, increase the longest side setting parameter [det_limit_side_len] 
(https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e 1363 1 dfb lac2 1d2095d4d4a4993ef7 10/tools/infer/utility.py? pjax=# js- 
repo-pjax-container, div[itemtype="http://schema.org/SoftwareSourceCode”] main, [data-pjax-container]#L47), which 

is 960 by default. Also, you can observe whether the missed texts are segmented by visualizing the post-processed 
segmentated images. If there is no segmentation, it means the model is not well trained. If there is a complete 
segmentation area, it means that missing detection is due to the problem in the post-processing. In this case, it is 
recommended to adjust DB post-processing parametersf2] 


**1_6 How to deal with the missing detection of curved texts (such as a slightly distorted document image) by using DB? 


A: To calculate the average score of the text box in the DB post-processing is to calculate that of the rectangle area, but 
it is easy to cause missing detection of the curved texts. So the calculation of the average score of the polygon area has 
been added, which will be more accurate, but the speed may decrease. You can make a choice as you want. and you can 
see the [visualized comparison of effect] (https://github.com/PaddlePaddle/PaddleOCR/pull/2604) in the relevant pr. 
This function is selected by the parameter det_db_score_mode, and its value is optional [fast (default) , slow], fast 
corresponds to the original rectangle mode, and s low corresponds to the polygon mode. Thank the user buptlihang for 
putting forward pr to help solve this problem. 


1.7 In simple OCR tasks with low accuracy requirements, how many datasets do I need to prepare? 


A: (1) The amount of training data is related to the complexity of the problem to be solved. The more difficult it is, the 
more accurate it requires, and more datasets are needed. Usually, if more data are trained, the result will be better. 
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(2) In scenarios with low accuracy requirements, the amount of data required for detection and recognition is different. 
In detection, 500 images can reach the basic detection standarad. In recognition, it is necessary to ensure that the number 
of text images in which each character in the recognition dictionary appears in different scenes is greater than 200 (for 
example, if there are 5 characters in the dictionary, each word needs to appear in more than 200 pictures, then the 
minimum number of images should be between 200-1000), so that the basic recognition standard can be guaranteed. 


1.8 How to get more data when the amount of training data is small? 


A: When the amount of training data is small, you can try the following three ways to get more data: (1) The most direct 
and effective way is to collect more training data manually. (2) Perform basic image processing or transformation based on 
PIL and opencv. For example, write texts onto the background using ImageFont, Image, and ImageDraw in PIL, opencv’s 
rotation and affine transformation, Gaussian filtering and so on. (3) Synthesize data using data generation algorithms, such 
as algorithms such as pix2pix. 


1.9 How to replace the backbone of text detection/recognition? 


A: No matter in text detection or text recognition, the choice of the backbone network is a trade-off between prediction 
effect and efficiency. Generally, if you choose a larger-scale backbone network, such as ResNet101_vd, the detection 
or recognition will be more accurate, but the inference time will increase accordingly. However, choosing a smaller- 
scale backbone network such as MobileNetV3_small_x0_35, will make inference faster, but the accuracy of detection 
or recognition will be greatly reduced. Fortunately, the detection or recognition effects of different backbone networks 
are positively correlated with the task of classifying images into 1000 classes in the ImageNet dataset. PaddleClas, an 
image classification suite of PaddlePaddle, includes 23 series of classification network structures such as ResNet_vd, 
Res2Net, HRNet, MobileNetV3, and GhostNet, the top! recognition accuracy rate of the above image classification task, 
the inference time of GPU (V100 and T4) and CPU (Snapdragon 855), and the corresponding 117 pre-training model 
download addresses. 


(1) The replacement of the text detection backbone network is mainly to determine 4 stages similar to ResNet to facilitate 
the integration of detection heads similar to FPN. In addition, in the text detection, using the classification trained by 
ImageNet to pre-train models can accelerate the convergence and improve the effect. 


(2) The replacement of the backbone network for text recognition requires paying attention to the drop position of the 
network width and height. Since text recognition has a large ratio of width to height, the reduction of the height needs to 
be less than that of the width. You can refer to Changes to the MobileNetV3 backbone network in PaddleOCRE 


1.10 How to finetune the detection model, such as by freezing the previous layer or learning with a small learning 
rate for some layers? 


A: If you want to freeze certain layers, you can set the stop_gradient of the variable to True, so that all the parameters 
before calculating this variable will not be updated, refer to: https://www.paddlepaddle.org.cn/documentation/docs/zh/ 
develop/faq/train_cn.html#id4 


If you want to learn with a smaller learning rate for some layers, it is not so convenient to carry out this in the static 
graph. So you can set a fixed learning rate for the weight attribute when the parameters are initialized. For this, please 
refer to: https://www.paddlepaddle.org.cn/documentation/docs/en/develop/api/paddle/fluid/param_attr/ParamAttr_cn. 
html#paramattr 


In fact, our experiment has found that it is also feasible to load the model for finetuning without setting different learning 
rates of certain layers. 


1.11 In the preprocessing of DB, why should the length and width of the picture be set to multiples of 32? 


A: It is related to the stride of the network downsampling. Take the resnet backbone network of the detection as an 
example. After the image is input to the network, it needs to be downsampled by 2 for 5 times, 32 in total. Therefore, it 
is recommended that the input image size be a multiple of 32. 


1.12 In the PP-OCR series models, why does the backbone network for text detection not use SEBlock? 


A: The SE module is an important module of the MobileNetV3 network. It is to estimate the importance of each feature 
channel of the feature map, assign weights to each feature of the feature map, and improve the expressive ability of the 
network. However, for text detection, the resolution of the input network is large, usually 640*640, so it is difficult to use 
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the SE module to estimate the importance of each feature channel of the feature map. The network improvement ability 
is limited, but the module is relatively time-consuming. Therefore, in the PP-OCR system, the backbone network for text 
detection does not use the SE module. An experiment also shows that when the SE module is removed, the size of the 
ultra-lightweight model can be reduced by 40%, and the text detection effect is basically not affected. For more details, 
please refer to the PP-OCR technical article, https://arxiv.org/abs/2009.09941. 


1.13 How to optimize the PP-OCR detection effect if it is not good? 
A: It depends. 


If the detection effect is not available on your scene, the priority is to finetune your data; 


If the image is too large and the text is too dense, do not over-compress the image. You can try to modify the resize 
logic of the preprocessing of the detection to prevent the image from being over-compressed; 


If the size of the detection box is too close to the text or too large, you can adjust the db_unclip_ratio. Increasing 
the parameter can enlarge the detection box, and reducing the parameter can reduce the size of the detection box; 


If many detection boxes are missed in detection, you can reduce the threshold parameter det_db_box_thresh to 
prevent some detection boxes from being filtered out. You can also try to set det_db_score_mode to’slow’;; 


You can also choose use_dilation as True to expand the output feature map of the detection. In general, the effect 
will be improved. 


1.14 What should I do if I encounter part of the text is missed in detection like the figure below? 


C— GTbox CC) The Probability Map 
Co Predicted Box 


Figure 7: Missing detection 


A:The above problem shows that part of the text is detected, but because the IOU of the text inference box and the 
GT box is greater than the threshold 0.5, the detection indicators cannot be fed back normally. If there are many such 
cases, increase the IOU threshold. In addition, the reason for missing detection is that the features of some texts do not 
respond. But ultimately, it is the network that has not learned features of the missed texts. Therefore,case-by-case analysis 
is feasible. Visualize the inference results to analyze the reason, and figure out whether it is caused by factors such as 
lighting, deformation, long text, and so on. Then use optimize the detection results through data agumentation, network 
adjustment, or post-processing adaptation. 
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FAQ about Model Inference in the Text Detection 
2.1 In the DB solution, the boundaries of some boxes are too close to the text to make some corners of the text 
detected. Is there any solution? 


A: Appropriately increase the post-processing parameter unclip_ratio. The larger the parameter is, the larger the text box 
will be. 


2.2 Why does inference in the PaddleOCR detection only support one image for testing? That is, 
test_batch_size_per_card=1. 


A:Scale the image in equal proportions, and the longest side is 960. The length and width of different images varies after 
the scaling, and they cannot form a batch, so set test_batch_size to 1. 


2.3 Accelerate the model inference in PaddleOCR text detection on the CPU? 


A: It is feasible to use mkldnn (OneDNN) for acceleration on x86 CPU . And start the enable_mkldnn parameter on the 
CPU supporting mkldnn. In addition, increasing num_threadsused for inference on the CPU can speed up the inference 
speed on the CPU. 


2.4 Accelerate PaddleOCR’s text detection model prediction on GPU? 
A: It is recommended to use TensorRT to accelerate inference on GPU. 
e 1. Download the Paddle installation package or inference library with TensorRT from link. 


° 1. Download the TensorRT from the Nvidia official website. Note that the version of the downloaded TensorRT 
version is the same as that compiled in the Paddle installation package. 


° 1. Set the environment variable LD_LIBRARY_PATH to point to the lib folder of TensorRT 


export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<TensorRT-S{version}/lib> 


e 1. Enable tensorrt option. 
2.5 How to deploy the PaddleOCR model on the mobile terminal? 


A: PaddlePaddle has a special tool for mobile-device deployment: PaddleLite, and PaddleOCR provides DB+CRNN as 
the demo code on android arm deployment. Refer to link. 


2.6 How to use multi-process inference on PaddleOCR? 


A: PaddleOCR has recently added Control parameters of multi-process inference. use_mp indicates whether there are 
multiple processes, and tot al_process_num represents the number of processes. For detailed usage, please refer to 
document. 


2.7 How to solve video memory explosion and memory leak during inference? 


A: If it is the inference of the training model, it is because the video memory is not enough for the model or the input 
image is too large. You can refer to the code and add paddle.no_grad() before the main function runs to reduce the video 
memory usage. If the memory usage of the inference model is too high, you can add config.enable_memory_optim() to 
reduce the memory usage in configuration. 


In addition, regarding the memory leak in the inference, please install the latest version of paddle where the memory leak 
has been fixed. 
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4.2.5 Assignment 


Short-answer questions: 


e 1. According to the size of the output feature maps of DB Backbone and FPN, determine the height and width 
of the input image of DB need to be multiples of: A: 32, B: 64 


Experiment: 
e 1. Use the DB algorithm configuration file configs/det/det_mv3_db.ym1 to train the text detection model on the 


dataset det_data_lesson_demo.tar, and optimize experimental accuracy. 
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Figure 8: Example of det_data_lesson_demo training data 


4.3 Summary 


This chapter introduces the theory and practice of text detection algorithms. 


The first section introduces how to get started quickly with the PaddleOCR text detection model, and demonstrates the 
implementation process from data processing to text detection algorithm training with the example of the DB algorithm. 
The next section will talk about the text recognition algorithms. 


The second one introduces the development of text detection in recent years, including regression-based and segmentation- 
based text detection methods. Also, the methodology and ideas of some classic papers are listed. The next section will 
take the PaddleOCR open-source library as an example to detail how to construct text detection algorithms from scratch 
and implement the training. 


76 Chapter 4. Text Detection 


Dive into OCR 


4.4 Reference 


a 


N 


io) 


Nn 


fon 


. Liao, Minghui, et al. “Textboxes: A fast text detector with a single deep neural network.” Thirty-first AAAT 


conference on artificial intelligence. 2017. 


. Liao, Minghui, Baoguang Shi, and Xiang Bai. “Textboxes++: A single-shot oriented scene text detector.” [EEE 


transactions on image processing 27.8 (2018): 3676-3690. 


. Tian, Zhi, et al. “Detecting text in natural image with connectionist text proposal network.” European conference 


on computer vision. Springer, Cham, 2016. 


Zhou, Xinyu, et al. “East: an efficient and accurate scene text detector.” Proceedings of the IEEE conference on 
Computer Vision and Pattern Recognition. 2017. 


. Wang, Fangfang, et al. “Geometry-aware scene text detection with instance transformation network.” Proceedings 


of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 


. Yuliang, Liu, et al. “Detecting curve text in the wild: New dataset and new solution.” arXiv preprint 


arXiv:1712.02170 (2017). 


. Deng, Dan, etal. “Pixellink: Detecting scene text via instance segmentation.” Proceedings of the AAAI Conference 


on Artificial Intelligence. Vol. 32. No. 1. 2018. 


. Wu, Yue, and Prem Natarajan. “Self-organized text detection with minimal post-processing via border learning.” 


Proceedings of the IEEE International Conference on Computer Vision. 2017. 


. Tian, Zhuotao, et al. “Learning shape-aware embedding for scene text detection.” Proceedings of the IEEE/CVF 


Conference on Computer Vision and Pattern Recognition. 2019. 


10. Wang, Wenhai, et al. “Shape robust text detection with progressive scale expansion network.” Proceedings of the 
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. 

11. Wang, Wenhai, et al. “Efficient and accurate arbitrary-shaped text detection with pixel aggregation network.” 
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. 

12. Liao, Minghui, et al. “Real-time scene text detection with differentiable binarization.” Proceedings of the AAAI 
Conference on Artificial Intelligence. Vol. 34. No. 07. 2020. 

13. Hochreiter, Sepp, and Jiirgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735- 
1780. 

14. Dai, Pengwen, et al. “Progressive Contour Regression for Arbitrary-Shape Scene Text Detection.” Proceedings of 
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. 

15. He, Minghang, et al. “MOST: A Multi-Oriented Scene Text Detector with Localization Refinement.” Proceedings 
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. 

16. Zhu, Yiqin, et al. “Fourier contour embedding for arbitrary-shaped text detection.” Proceedings of the IEEE/CVF 
Conference on Computer Vision and Pattern Recognition. 2021. 

17. Tang, Jun, et al. “Seglink++: Detecting dense and arbitrary-shaped scene text by instance-aware component group- 
ing.” Pattern recognition 96 (2019): 106954. 

18. Wang, Yuxin, et al. “Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection.” 
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. 

19. Zhang, Chengquan, et al. “Look more than once: An accurate detector for text of arbitrary shapes.” Proceedings 
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. 

20. Xue C, Lu S, Zhang W. Msr: Multi-scale shape regression for scene text detection[J]. arXiv preprint 
arXiv:1901.02596, 2019. 

4.4. Reference 77 


Dive into OCR 


78 


Chapter 4. Text Detection 


CHAPTER 
FIVE 


TEXT RECOGNITION 


This chapter is mainly about the theoretical knowledge of text recognition algorithms, including the background, classes 
of algorithms and some classic paper ideas. 


In this chapter, you can learn: 
1. The goal of text recognition 
2. Types of text recognition algorithms 
3. Typical ideas of various algorithms 


Text recognition is a subtask of OCR (Optical Character Recognition), aimed at recognizing the content of one specific 
area. In the two-stage method of OCR, it comes after text detection to convert an image into a text. 


Specifically, the model inputs a localized text line and predicts its content and confidence value. The visualization results 
are as follows: 


REE of word_2.jpg:L' MiB MAE HIGEK25-265', 0.9957605] 


Predicts of word_3004. jpg: ['R2eBit', @.99971426] 


Figure 1: Visualization results of model prediction 


There are many application scenarios for text recognition, such as document recognition, road sign recognition, license 
plate recognition,industrial number recognition, etc. In actual scenarios, the task of text recognition can be divided into 
two categories: Regular text recognition and Irregular text recognition. 


¢ Regular text recognition: It mainly fits into the text which is considered mostly horizontal, like printed fonts, scanned 
texts, etc. 


¢ Irregular text recognition: It is common in natural scenes in which texts are not horizontal and some are distorted, 
covered, or blurred, due to the large variance of its appearance including curvature, orientation, and distortion. 


The figures below show the data patterns of IC15 and IC13, which respectively represent the irregular text and the regular 
text. Obviously, the irregular one often has problems such as distortion, blurring, and large font differences. It is closer 
to the natural scene but more challenging. 


Therefore, the current major algorithms are trying to process irregular datasets with higher precision. 


79 


Dive into OCR 


Figure 2: IC15 picture sample (irregular texts) 


SMOKING] parcel OM [itt 


Figure 3: IC13 picture sample (regular texts) 


The two public data sets are often involved in the comparison of algorithm capabilities. After multi-dimensional analysis, 
the common classification of English benchmark datasets is as follows: 


IIITSK 


Regular text 


Text Recognition Datasets 
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Figure 4: Common English benchmark datasets 
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5.1 Introduction to Text Recognition Methods 


In the traditional text recognition method, the task is divided into 3 steps: image preprocessing, character segmentation 
and character recognition. In this way, it is necessary to model a specific scene, but the model will fail once the scene 
changes. In the face of complex text backgrounds and scene changes, methods based on deep learning can perform better. 


Most recognition algorithms can be represented by the following framework, and every algorithm flow is divided into 4 


stages: 


Input image X 


Spatial Transformation 
Module : Rectified image 
commonly used STN or 
TPS 


Normalized image x 


Vision Feature Extraction 
Module : 

Extract image features, 
such as VGG, RCNN, 


Visual feature V 


ya UN\\ ER fa 
Sa s ¥ 


Sequence Feature Extraction 
Module: 

Generates a sequence of 
characters , such as BiLSTM, 


Contextual feature H 


MUO >> 00 


Prediction Y 


s 


Predict Module : 
converting the per- 
frame predictions made 
by RNN into a label 


ResNet, etc. Transformer, etc. sequence.such as CTC, 
Attention 
We have sorted out the main algorithm types and major papers as below: 

Algorithm type | Main ideas Main papers 

Traditional Sliding window, character extraction, dynamic programming - 

algorithm 

CTC Based on CTC, faster recognition can be realized without predefined | CRNN, Rosetta 
alignment. 

Attention Based on attention, the method can be applied to unconventional text. | RARE, DAN, PREN 

Transformer Transformer-based method SRN, NRTR, Master, 

ABINet 

Rectification The rectification module learns the text boundary and corrects it along | RARE, ASTER, SAR 
the horizontal axis 

Segmentation The segmentation-based approach extracts the character position and | Text Scanner, Mask 
then categorize the text TextSpotter 


5.1.1 Regular Text Recognition 


There are two major algorithms for text recognition, namely the CTC (Connectionist Temporal Classification)-based 
algorithm and the Sequence2Sequence algorithm. They mainly differ in decoding. 


The CTC-based algorithm puts the encoded sequence into CTC for decoding; the Sequence2Sequence-based method 
puts the sequence into the Recurrent Neural Network (RNN) module for cyclic decoding. The two methods have been 
verified to be effective and they are also the mainstream. 
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Decode 


RNN 


Encode 


Figure 5: Left: CTC-based method, right: Sequece2Sequence-based method 


Algorithms Based on CTC 


The most typical CTC-based algorithm is CRNN (Convolutional Recurrent Neural Network) [1], which uses mainstream 
convolutional structures such as ResNet, MobileNet, VGG, etc. to extract features. Due to the particularity of text recog- 
nition tasks, there is a large amount of contextual information in the input data. The convolution kernel feature of the 
convolutional neural networks (CNN) leads to its focus on local information and lack of modeling long-term dependency, 
so it is difficult to dig into the contextual connections by only using CNN. To solve this problem, the CRNN text recog- 
nition algorithm introduces the bidirectional LSTM (Long Short-Term Memory) to enhance the context modeling. And 
experiments have proved that the bidirectional LSTM module can effectively extract the contextual information of the 
picture,and finally enter the output feature sequence into the CTC module, and decode the sequential result. This struc- 
ture has been validated and widely used in text recognition. Rosetta [2] is a recognition network proposed by FaceBook, 
consisting of a fully convolutional model and CTC. Gao Y [3] et al. have used CNN convolution to replace LSTM for it 
has fewer parameters, better performance and the same accuracy. 
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Figure 6: CRNN structure diagram 


Sequence2Sequence Algorithms 


In the Sequence2Sequence algorithms, the Encoder encodes all input sequences into unified semantic vectors, which are 
then decoded by the Decoder. In the decoding process, every output at the time step t-1 is continuously used as the input 
of the time step t, and the decoding is performed in a loop until the stop character is output. A general encoder is one 
RNN. For each input word, the encoder outputs a vector and hidden state, and uses the hidden state for the next input 
word to get the semantic vector in a loop; the decoder is another RNN, which receives output vectors of the encoder 
and output a series of words to create transformation. Inspired by Sequence2Sequence in translation, Shi [4] proposed 
an attention-based codec framework for text recognition. In this way, RNN can learn character-level language models 
hidden in strings from training data. 


Figure 7: Sequence2Sequence structure diagram 
The two kinds of algorithms perform well when it comes to regular texts, but limited by the network design, the kind of 


methods is hard to solve the curved or oriented irregular text. To solve such problems, some algorithmic researchers have 
proposed improved algorithms based on the above two. 
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5.1.2 Irregular Text Recognition 


¢ Irregular text recognition algorithms can be divided into 4 categories: Rectification-based methods, Attention-based 
methods, Segmentation-based methods, and Transformer-based methods. 


Rectification-based Methods 


The Rectification-based method adopts some visual transformation modules to convert the irregular text into the regular 
text as much as possible, and then uses conventional recognition methods . 


The RARE [4] model has first proposed a rectification scheme for irregular text. The network is of two main parts: an 
STN (Spatial Transformer Network) and a recognition network based on Sequence2Squence. STN is the rectification 
module. An Irregular text image enters into STN and turns into a horizontal image through TPS (Thin-Plate-Spline). 
This transformation can rectify curved and transmissive texts to some extent, Then the rectify image will be sent to the 
sequence recognition network for decoding. 


Input Image Rectified Image 


Spatial Sequence 


Transformer Recognition 
Network Network 


Figure 8: RARE structure diagram 


The paper of RARE points out that this method has greater advantages in irregular text datasets. The two datasets CUTE80 
and SVTP are compared, and they are more than 5 percentage points higher than CRNN, proving that the rectification 
module works. Based on this, [6] also combines an STN with a text recognition system based on the attention sequence 
recognition network. 


The rectification-based method is more flexible in migration. In addition to attention-based methods such as RARE, 
STAR-Net [5] applies the rectification module to the CTC-based algorithm, which has improved a lot compared with 
CRNN. 


Aitention-based Methods 


The attention-based method concentrates on the correlation between sequences. This method was first proposed in ma- 
chine translation. It is believed that the translation is mainly affected by certain words, and decisive words should be 
offered more weights. The same goes for text recognition. When decoding encoded sequences, each step selects the 
appropriate context to generate the next state, which can obtain more accurate results. 


R‘2AM [7] is the first to introduced Attention into text recognition. The model first extracts the encoded image features 
from the input image through a recursive convolutional layer, and then uses the implicitly learned character-level linguistic 
statistics to decode the output character through a recurrent neural network. In the decoding process, for better utilization 
of image features, the attention mechanism is introduced to realize soft feature selection. This selective processing is 
closer to human intuition. 
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Figure 9: R? AM structure drawing 


A large number of algorithms have explored and updated in the field of attention. For example, SAR[8] have extended 
1D attention to 2D attention. The RARE mentioned above is also based on attention. Experiments have also shown that 
the attention-based method has greater accuracy compared with the CTC method. 
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Figure 10: Attention diagram 


Segmentation-based Methods 


The Segmentation-based method is to treat each character of a text line as an individual unit, which can recognize the 
segmented characters more easily than to recognize the rectified entire text line. It tries to locate each character in the 
input text image, and uses a character classifier to obtain the recognition results, simplifying the complex global problem 
into a local problem and it works well in the irregular text scenes. However, this method requires character-level labeling 
which is relatively difficult in acquisition. Lyu [9] et al. propose an example-based word segmentation model to recognize 
words, which adopts the FCN (Fully Convolutional Network) based method in its recognition part. In [10] a character 
attention FCN is designed to solve the problem of text recognition from a two-dimensional perspective. When the text is 
curved or severely distorted, this method can perform better in positioning both regular and irregular texts. 
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Figure 11: Mask TextSpotter structure diagram 


Transformer-based Methods 


With the rapid development of transformer, the effectiveness of transformer in visual tasks has been verified in classfication 
and detection. As mentioned in the regular text recognition, CNN has limitations in modeling long-term dependency. The 
transformer structure can solve this problem. It can pay attention to global information in the feature extractor and replace 
additional context modeling modules (LSTM). 


Some text recognition algorithms use the transformer’s encoder structure and convolution to extract sequence features. 
The encoder is composed of multiple blocks stacked by MultiHeadAttention Layer and Positionwise Feedforward Layer. 
The self-attention in MulitHeadAttention uses matrix multiplication to simulate the time-series computation of RNN, 
liberating it from the long-term dependence on time series in RNN. There are also some algorithms using the transformer’s 
decoder for decoding. It can obtain stronger semantic information than RNN and enjoys higher efficiency in parallel 
computing. 


The SRN[11] algorithm connects the encoder of transformer to ResNet50 to enhance the 2D visual features. Also, it 
puts forward a parallel attention module, which uses the reading order as a query to make the calculation independent of 
time, and parallelly outputs aligned visual features of all time steps. In addition, SRN also uses transformer’s encoder as 
a semantic module to integrate visual information and semantic information of the picture, more effective in dealing with 
the covered and blurred irregular texts. 


NRTR [12] uses a complete transformer structure to encode and decode the input picture, and only uses a few sim- 
ple convolutional layers for high-level feature extraction. The transformer structure has been proved effective in text 
recognition. 


SRACN [13] uses transformer’s decoder to replace LSTM,which once again verifies that parallel training is efficient and 
accurate. 


5.2 Practice of a Text Recognition CRNN 


The theory presented in the last section introduces the main methods of text recognition. And the CRNN is among the 
ones that have been early proposed and more widely used in the industrial sector. This chapter will elaborate on how to 
build, train, evaluate and predict the CRNN text recognition model based on PaddleOCR. The dataset is icdar 2015, in 
which there are 4468 training sets and 2077 test sets. 


In this section, you can learn: 
1. How to use PaddleOCR whl packages to quickly predict text recognition, 
2. The basic principles and network structure of CRNN, 
3. The necessary steps and parameter adjustment methods of model training, 


4. How to use a custom dataset to train the network. 
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5.2.1 Quick Start 


Installation of Related Dependencies and WHL Packages 


First, confirm that paddle and paddleocr are installed. If it is done, skip this step. 


# Install PaddlePaddle GPU version 
!pip install paddlepaddle-gpu 

# Install PaddleOCR whl package 
Woulfor atingieeulal- 10! je)alj9) 

‘pip install paddleocr 


Quick Prediction of Content 


The PaddleOCR whl package will automatically download the ppocr lightweight model as the default model. 
The following shows how to use the whl package for recognition prediction: 


Test picture: 


SLOW 


from paddleocr import PaddleOCR 


ocr = PaddleOCR() # need to run only once to download and load model into memory 
img_path = '/home/aistudio/work/word_19.png' 
result = ocr.ocr(img_path, det=False) 


for line in result: 
print (line) 


After executing the above code, the recognition result and recognition confidence will be returned. 


("SLOW",. 0.9776376) 


So far, you have learned how to use the PaddleOCR whl package to make predictions. There are more test pictures in the 
. /work/ path, so you can try other picture results. 


5.2.2 Detailed Implementation of CRNN Text Recognition Algorithm 


In the 4.2.1 section, paddleocr has loaded the trained CRNN recognition model for prediction. And this section will 
introduce the principles and process of CRNN in detail. 
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Category 


CRNN is a CTC-based algorithm. Its position in the classification diagram presented in the theory part is as follows. It 
can be seen that CRNN is mainly used to deal with regular texts, and the CTC-based algorithm is faster in prediction 


speed and fits into long texts. Therefore, CRNN is chosen by PPOCR to recognize Chinese characters. 
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Detailed Introduction of the Algorithm 


CRNN’s network structure system is as follows. From the bottom to the top, there are three parts: convolutional layers, 


recurrent layers, and transcription layers: 
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1. Backbone: 


As an underlying backbone network, the convolutional network is used to extract feature sequences from the input im- 
age. Since conv, max—pooling, elementwise and activation functions all act on local areas, they are translation 
invariant. Therefore, each column of the feature map corresponds to a rectangular area (called a receptive field) of the 
original image, and these rectangular areas are in the same order from left to right as their corresponding columns on the 
feature map. Because in CNN needs to scale the input image to meet its fixed input dimensionality, it is not suitable for 
sequences that vary greatly in length. To better support sequences in variable lengths, CRNN sends the feature vector 
output from the last layer of backbone to the RNN layer and converts it into a sequence feature. 
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Feature Sequence 


Receptive field 


1. Neck: 


Based on a convolutional network, the recurrent layer builds a recuurent network, converts image features into sequence 
features, and predicts the label distribution of each frame. RNN is very good at capturing context information of se- 
quences. When it comes to Image-based sequence recognition, contextual cues is more useful than the individual pro- 
cessing of each pixel. Taking scene text recognition as an example, a wide character may require several consecutive 
frames for elaborate description. In addition, some ambiguous characters are easier to be clarified by observing the 
context. Second, RNN can propagate errors back to the convolutional layer, so that the network can be trained uni- 
formly. Third, RNN can operate on sequences of any length, tackling the problem that text images become longer. And 
CRNN uses double-layer LSTM as the recurrent layer to solve the gradient disappearance and explosion in training long 
sequences. 
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1. Head: 


The transcription layer converts the prediction of each frame into the final label sequence through the fully connected 
network and the softmax activation function. Finally, CTC Loss is used to complete the joint training of CNN and RNN 
without requiring sequence alignment. CTC has a special mechanism to merging sequences. After LSTM outputs the 
sequence, it needs to be classified in time sequence to obtain the prediction result. There may be multiple time steps 
corresponding to one category, so the same results need to be merged. In order to avoid merging the existing repeated 


characters, CTC introduced a blank character which is inserted between the repeated ones. 
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Code Example 


The entire network structure is concise, and the code implementation is relatively simple. Modules can be built in sequence 
following the predicted process. In this section, four things needs to be finished: data input, backbone building, neck 
building, and head building. 


[Data Input] 


The data needs to be scaled to a uniform size (3,32,320) and normalized before being sent to the network. The data aug- 
mentation part required in the training is elided for brevity here, and the necessary steps of preprocessing are exemplified 
in a single imagesource code location: 


import cv2 
import math 
import numpy as np 


def resize_norm_img(img): 
mone 
Data scaling and normalization 
iparam img: input picture 


mon 


# Default input size 


imgC = 3 
imgH = 32 
imgW = 320 


# The real height and width of the picture 
h, w = img.shape[:2] 
# Picture real aspect ratio 


(continues on next page) 
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(continued from previous page) 


Bacio — we tilkoarce (in) 


# Scale 
if math.ceil(imgH * ratio) > imgW: 
# If greater than the default width, the width is imgwW 
resized_w = imgW 
else: 
# If it is smaller than the default width, the actual width of the picture. 
«Shall prevail 
resized_w = int(math.ceil(imgH * ratio) ) 


# Zoom in and out 

resized_image = cv2.resize(img, (resized_w, imgH) ) 
resized_image = resized_image.astype('float32') 

# Normalize 

resized_image = resized_image.transpose((2, 0, 1)) / 255 
resized_image -= 0.5 

resized_image /= 0.5 

# For positions with insufficient width, add 0 


padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32) 
padding_im[:, :, O:resized_w] = resized_imag 

# Transpose the image after padding for visualization 
draw_img = padding_im.transpose((1,2,0) ) 

return padding_im, draw_img 


import matplotlib.pyplot as plt 

# Read the picture 

raw_img = cv2.imread("/home/aistudio/work/word_1.png") 
plt.figure() 


pilt. subplot (2,402) 

# Visualize the original image 

plt.imshow (raw_img) 

# Scale and normalize 

padding_im, draw_img = resize_norm_img(raw_img) 


joulkic, g Siblojonlione, (2, tb 2) 
# Visual network input diagram 
. imshow (draw_img) 


plt.show() 


[Network Structure] 
¢ Backbone 


PaddleOCR adopts MobileNetV3 as the backbone network with a networking sequence consistent with the network 
structure. First, define the public modules in the network (source code location): ConvBNLayer, ResidualUnit, 
and make_divisible. 


import paddle 
import paddle.nn as nn 
import paddle.nn.functional as F 


class ConvBNLayer (nn.Layer) : 
def __init__(self, 
in_channels, 
out_channels, 
kernel_size, 
stride, 
(continues on next page) 
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def 


(continued from previous page) 


padding, 
groups=1, 
if_act=True, 
act=None) : 
woo 
Convolutional BN Layer 
:param in_channels: number of input channels 
iparam out_channels: number of output channels 
iparam kernel_size: convolution kernel size 
sparma stride: stride size 
iparam padding: padding size 
iparam groups: the number of groups of the two-dimensional convolutional layer 
iparam if_act: whether to add activation function 
iparam act: activation function 
moo 
super (ConvBNLayer, self) .__init__() 
self.if act = if_act 
self.act = act 
self.conv = nn.Conv2D ( 
in_channels=in_channels, 
out_channels=out_channels, 
kernel_size=kernel_size, 
stride=stride, 
padding=padding, 
groups=groups, 
bias_attr=False) 


self.bn = nn.BatchNorm(num_channels=out_channels, act=None) 


forward(self, x): 
# conv layer 
x = self.conv(x) 
# batchnorm layer 
x = self.bn(x) 
# whether to use the activation function 
Df (seek sealer 
if self.act == "relu": 
x = F.relu(x) 
elif self.act == "hardswish": 
x = F.hardswish (x) 
else: 
print ("The activation function({}) is selected incorrectly.". 
format (self.act) ) 
exit () 
return x 


class SEModule(nn.Layer): 


def 


__init__(self, in_channels, reduction=4): 
noe 

SE module 

iparam in_channels: number of input channels 


iparam reduction: channel zoom ratio 
voor 


super (SEModule, self) .__init__() 
self.avg_pool = nn.AdaptiveAvgPool12D (1) 
self.convil = nn.Conv2D ( 
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in_channels=in_channels, 
out_channels=in_channels // reduction, 
kernel_size=1, 


stride=1, 
padding=0) 
self.conv2 = nn.Conv2ZD ( 


in_channels=in_channels // reduction, 
out_channels=in_channels, 
kernel_size=1, 

stride=1, 

padding=0) 


forward(self, inputs): 
# Average pooling 


outputs = self.avg_pool (inputs) 

# First convolutional layer 

outputs = self.convi (outputs) 

# Relu activation function 

outputs = F.relu(outputs) 

# Second convolutional layer 

outputs = self.conv2 (outputs) 

# Hardsigmoid activation function 

outputs = F.hardsigmoid(outputs, slope=0.2, offset=0.5) 


return inputs * outputs 


class ResidualUnit (nn.Layer): 
deft Ssinite (sete, 


in_channels, 

mid_channels, 

out_channels, 

kernel_size, 

stride, 

use_se, 

act=None) : 
sory 
Residual layer 
iparam in_channels: number of input channels 
iparam mid_channels: number of intermediate channels 
iparam out_channels: Number of output channels 
sparam kernel_size: convolution kernel size 
:parma stride: stride size 
iparam use_se: whether to use se module 
iparam act: activation function 
oe 
super (ResidualUnit, self) .__init__() 
self.if_shortcut = stride == 1 and in_channels == out_channels 
self.if_se = use_se 


self.expand_conv = ConvBNLayer ( 
in_channels=in_channels, 
out_channels=mid_channels, 
kernel_size=1, 
stride=1, 
padding=0, 
if_act=True, 
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act=act) 
self.bottleneck_conv = ConvBNLayer ( 
in_channels=mid_channels, 
out_channels=mid_channels, 
kernel_size=kernel_size, 
stride=stride, 
padding=int ((kernel_size - 1) // 2), 
groups=mid_channels, 
if_act=True, 
act=act) 
if self.if_se: 
self.mid_se = SEModule(mid_channels) 
self.linear_conv = ConvBNLayer ( 
in_channels=mid_channels, 
out_channels=out_channels, 
kernel_size=1, 
stride=1, 
padding=0, 
if_act=False, 
act=None) 


def forward(self, inputs): 


x = self.expand_conv (inputs) 
x = self.bottleneck_conv (x) 
if self.if_se: 

x = self.mid_se (x) 
x = self.linear_conv (x) 


LE Ssclitecistashormeeuies: 
x = paddle.add(inputs, x) 
return x 


def make_divisible(v, divisor=8, min_value=None) : 


mon 


Make sure to be divisible by 8 
none 
if min_value is None: 
min_value = divisor 
new_v = max(min_value, int(v + divisor / 2) // divisor * divisor) 
Sle SaKeN LN KS 08) = NAB 
new_v += divisor 
return new_v 


Use public modules to build backbone networks: 


class MobileNetV3 (nn.Layer): 
def init (self, 

in_channels=3, 
model_name='small', 
scale=0.5, 
small_stride=None, 
disable_se=False, 
**kwargs): 

super (MobileNetV3, self) .__init__() 

self.disable_se = disable_se 
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Sill cGeiaes = (il, 25 2. Zi 
if model_name == "small": 
ence) = | 
wk OX, (Ch) Sey iguils ‘iS, 


3, 6, to, True, 'xelu',” (smaliiestride [0], 1); 

3, 12, 24, False, ‘relu’, (smallstridel et), 21, 

3, oo, 24, False, “relu", 11, 

5,90, 40. True, thardswash Ws (Smadiilestastcle 2 las a)ali, 
5, 240) 40, True,” “hardswish, 
5, 240, 40, True, ‘hardswish', 
5 
5 
5 
5) 
5 


al , 

1 , 
, 120, 48, True, ‘hardswish", i], 
, 144, 48, True, 'hardswish', 1], 
Meee, oo, True, Vharcdswash, |, (smaliersiei1 cee, all, 
, 296, Yo, LEue, “narcswieh’, 1 
, BO, 36, True, “hardswish", i 


, 


, 


] 


cls_ch_squeeze = 576 
else: 
raise NotImplementedError("mode[" + model_name + 
"_model] is not implemented!") 
Sibjojsroacisol crokelhe; = [O35 Ons, Walae te, ahe25] 
assert scale in supported_scale, \ 
"supported scales are {} but input scale is {}".format (supported_scale,u 
oscale) 
inplanes = 16 
# convil 
self.convi = ConvBNLayer ( 
in_channels=in_channels, 
out_channels=make_divisible(inplanes * scale), 
kernel_size=3, 
stride=2, 
padding=1, 
groups=1, 
if_act=True, 
act='hardswish') 
i= 0 
block_list = [] 


inplanes = make_divisible(inplanes * scale) 
pdepe! (en Jere Cp, SE, will S)) hel yokes 


se = se and not self.disable_se 
block_list.append ( 
ResidualUnit ( 


in_channels=inplanes, 
mid_channels=make_divisible(scale * exp), 
out_channels=make_divisible(scale * c), 
kernel_size=k, 
stride=s, 
use_se=se, 
act=nl1) ) 

inplanes = make_divisible(scale * c) 

4 += 1 

self.blocks = nn.Sequential (*block_list) 
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self.conv2 = ConvBNLayer ( 
in_channels=inplanes, 
out_channels=make_divisible(scale * cls_ch_squeeze), 
kernel_size=1, 
stride=1, 
padding=0, 
groups=1, 
if_act=True, 
act='hardswish') 


self.pool = nn.MaxPool2D(kernel_size=2, stride=2, padding=0) 
self.out_channels = make_divisible(scale * cls_ch_squeeze) 


def forward(self, x): 


x = self.convi (x) 
x = self.blocks (x) 
x = self.conv2 (x) 
x = self.pool (x) 
return x 


Here, the definition of the backbone network is completed, and the entire network structure can be visualized through the 
paddle.summary structure: 


# Define the network input shape 
IMAGE_SHAPE_C = 3 

IMAGE_SHAPE_H = 32 

IMAGE_SHAPE_W = 320 


# Visualize the network structure 
paddle.summary (MobileNetV3(),[(1, IMAGE_SHAPE_C, IMAGE _SHAPE_H, IMAGE _SHAPE_W) ]) 


# Input pictures in the backbone network 
backbone = MobileNetV3 () 

# Convert numpy data to Tensor 

input_data = paddle.to_tensor ([padding_im] ) 
# Output the Backbone network 

feature = backbone (input_data) 

# View the latitude of the feature map 
print ("backbone output:", feature.shape) 


¢ Neck 


The neck converts the visual feature map output by the backbone into a 1-dimensional vector input and sends it to the 
LSTM network, and outputs the sequence feature (source code location): 


class Im2Seq(nn.Layer): 


def _init__(self, in_channels, **kwargs): 
sory 


The image feature is converted into the sequence featur 
iparam in_channels: number of input channels 

moo 

Sybjeree ()) 5 cabbie) 

self.out_channels = in_channels 
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def forward(self, x): 
B, C, H, W = x.shape 


assert H == 1 

x = X.Squeeze (axis=2) 

x = x.transpose([0, 2, 1]) # (NWC) (batch, width, channels) 
return x 


class EncoderWithRNN (nn.Layer) : 
def __init (self, in_channels, hidden_size): 
super (EncoderWithRNN, self) .__init__() 
self.out_channels = hidden_size * 2 
self.lstm = nn.LSTM( 
in_channels, hidden_size, direction='bidirectional', num_layers=2) 


def forward(self, x): 
x, _ = self.ilstm(x) 
return x 


class SequenceEncoder (nn.Layer) : 
def __init__ (self, in_channels, hidden_size=48, **kwargs): 
sore 
Sequence encoding 
iparam in_channels: number of input channels 
sparam hidden_size: hidden layer siz 


mon 


super (SequenceEncoder, self) .__init__() 
self.encoder_reshape = Im2Seq(in_channels) 


self.encoder = EncoderWithRNN ( 
self.encoder_reshape.out_channels, hidden_size) 
self.out_channels = self.encoder.out_channels 


def forward(self, x): 


x = self.encoder_reshape (x) 

x = self.encoder (x) 

return x 
neck = SequenceEncoder (in_channels=288) 
sequence = neck (feature) 
print ("Sequence shape:", sequence.shape) 


¢ Head The prediction head is composed of a fully connected layer and softmax, which are used to calculate the 
label probability distribution on the time step of the sequence feature. This example only supports the model to 
recognize 36 categories of lowercase English letters and numbers (26+10) (source code location): 


class CTCHead(nn.Layer): 
def init__(self, 
in_channels, 
out_channels, 
**xkwargs) : 
sooo 
CTC prediction layer 
iparam in_channels: number of input channels 
iparam out_channels: number of output channels 
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moe 


super (CTCHead, self).__init__() 
self.fc = nn.Linear ( 


in_channels, 
out_channels) 


# Thinking: How much should be out_channels? 
self.out_channels = out_channels 


def forward(self, x): 
predicts = self.fc(x) 
result = predicts 


LE not Iselin rasan anges 
predicts = F.softmax(predicts, axis=2) 
result = predicts 


return result 


When the network is randomly initialized, the output results are unordered. Through SoftMax, the prediction results 
with the highest probability at each time step can be obtained, where: pred_id represents the predicted tag ID, and 
pre_scores represents the confidence of the predicted result: 


ctc_head = CTCHead(in_channels=96, out_channels=37) 
predict = ctc_head (sequence) 

print ("predict shape:", predict.shape) 

result = F.softmax(predict, axis=2) 

pred_id = paddle.argmax(result, axis=2) 

pred_socres = paddle.max(result, axis=2) 

print (Ypredsidi predsid) 

print ("pred_scores:", pred_socres) 


¢ Post-processing 


The final result returned by the recognition network is the maximum index value at each time step, and the final expected 
output is the corresponding text result. Therefore, the post-processing of CRNN is decoding. The main logic is as follows: 


def decode (text_index, text_prob=None, is_remove_duplicate=False) : 
WM SCOnVvVene Gext—InGex Inco! Eext— label. eiual 
character = "—-0123456789abcdefghijklimnopgqrstuvwxyz" 
result_list = [] 
# Ignore tokens [0] represents the blank bit in ctc 
ignored_tokens = [0] 
batch_size = len(text_index) 
for batch_idx in range (batch_size): 
char list = |) 
contact =] 
for idx in range(len(text_index[batch_idx])): 
if text_index[batch_idx] [idx] in ignored_tokens: 
continue 
# Merge the same characters between blanks 
if is_remove_duplicate: 
# only for predict 


if idx > 0 and text_index[batch_idx] [idx - 1] == text_index[ 
beaten _tdx|/ [da]: 
continue 
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# Store the decoded result in char_list 
char_list.append (character [int (text_index [batch_idx] [ 
idx])]) 
# Record confidence 
if text_prob is not None: 
conf_list.append(text_prob[batch_idx] [idx] ) 
else: 
conf_list.append(1) 
text = ''.join(char_list) 
# Output the result 
result_list.append( (text, np.mean(conf_list) )) 
return result_list 


Take the predicted result through the random initialization in the head as an example, and decode it to get: 


pred_id = paddle.argmax(result, axis=2) 
pred_socres = paddle.max(result, axis=2) 
print (pred_id) 

decode_out = decode (pred_id, pred_socres) 
print ("decode out:", decode_out) 


Quick test: If the index of the input model is trained, is the decoding result correct? 


# Replace the predicted result of the model 
right_pred_id = paddle.to_tensor([['xxxxxxxxxxxxx!']]) 
tmp_scores = paddle.ones (shape=right_pred_id. shape) 
out = decode (right_pred_id, tmp_scores) 

ronaaligtc (Yobhe 6 | oxo) 


The above steps have built the network and also realized a simple forward prediction. 


The untrained network cannot predict the result correctly. Therefore, it is necessary to define the loss function and 
optimization strategy to run the entire network. The network training principle will be described in detail below. 


5.2.3 Training the CRNN Text Recognition Model 


Preparation of Data Training 


PaddleOCR supports two data formats: -1mdb is used to train the dataset (LMDBDataSet) stored in Imdb format; - 
General Data is used to train a dataset (SimpleDataSet) stored in a text file; 


Here only introduces the reading of the general data format 


The default storage path of training data is . /t rain_data, execute the following command to decompress the data: 


ed /home/arstudto/work/train data/ G& tac xt aclioudata. tar 


After the decompression, the training images are in the same folder, and there is a txt file (rec_gt_train.txt) recording 
paths and labels of the images. The contents of the txt file are as follows: 


"Image file name Image annotation information" 


train/word_1.png Genaxis Theatre 
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train/word_2.png [06] 


Note: In txt files, picture paths and labels are divided by \t by default. If they are divided by other methods, it will cause 
training errors. 


The data set should have the following file structure: 


|-train_data 

|-ici5_data 
|- rvrec_gt_train.txt 

|- train 
— word_001.png 
— word_002.jpg 
— word_003.jpg 


|- rec_gt_test.txt 
|- test 

— word_001.png 
— word_002.jpg 
— word_003.jpg 


Confirm whether data paths in the configuration file are correct, take rec_icdar15_train.yml as an example: 


Train: 
dataset: 
name: SimpleDataSet 
# Training the data root directory 
data_dir: ./train_data/ic15_data/ 
# Training data labels 
label_file_ list: ["./train_data/ici5_data/rec_gt_train.txt"] 
transforms: 
- DecodeImage: # load image 
img_mode: BGR 
channel_first: False 
- CTCLabelEncode: # Class handling label 


— RecResizeImg: 
image_shape: [3, 32, 100] # [3,32,320] 
— KeepKeys: 
keep_keys: ['image', '‘label', 'length'] # dataloader will return list inw 
othis order 
loader: 


shuffle: True 
batch_size_per_card: 256 
drop_last: True 
num_workers: 8 
use_shared_memory: False 


Eval: 
dataset: 
name: SimpleDataSet 
# Evaluation of the data root directory 
data_dir: ./train_data/ic15_data 
# Evaluation of data labels 
label_file_ list: ["./train_data/ici5_data/rec_gt_test.txt"] 
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transforms: 
- DecodeImage: # load image 
img_mode: BGR 
channel_first: False 
— CTCLabelEncode: # Class handling label 


— RecResizeImg: 
image_shape: [3, 32, 100] 
— KeepKeys: 
keep_keys: ['image', 'label', 'length'] # dataloader will return list inw 
othis order 
loader: 


shuffle: False 
drop_last: False 
batch_size_per_card: 256 
num_workers: 4 
use_shared_memory: False 


Data Preprocessing 


The training data sent to the network must be consistent in dimension within a batch. At the same time, in order to ensure 
features between different dimensions are comparable in numbers, the data needs to be uniformly scaled and normalized. 


To increase the robustness of the model, suppress overfitting and improve generalization performance, a certain data 
augmentation needs to be implemented. 


¢ Scaling and normalization 


Related content has been introduced in the second section. This is the last step before pictures are sent to the network. 
Call resize_norm_img to complete image scaling, padding and normalization. 


¢ Data augmentation 


A variety of data augmentation methods are implemented in PaddleOCR such as: color inversion, random segmentation, 
affine transformation, random noise, etc. Here is an example of simple random cutting, more augmentation methods can 
be referred to: rec_img_aug.py 


def get_crop (image): 


mon 


random crop 


mon 


import random 


h, w, _ = image.shape 
top_min = 1 
top_max = 8 


top_crop = int (random.randint (top_min, top_max) ) 
Cle Iouciejs: = iqsligl (hevojel_(eharoyor, Jel — iL) 
crop_img = image.copy () 
ratio = random.randint(0, 1) 
Tf eawao: 
Crop. ing = crop img | toOpacrop sh, 7, 21 
else: 
(SreKojoy_Aloy = iohatojoy_alnatef|L ein = weer ichaojey 5 4) 
return crop_img 
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# Read the picture 

raw_img = cv2.imread("/home/aistudio/work/word_1.png") 
lt .figure() 

Lc a rsibllojedkone, (4.4L il) 

Visualize the original image 
1t.imshow (raw_img) 

Cut randomly 

rop_img = get_crop (raw_img) 

lite. subplot(2; 17 .2)) 

Visualize the augmentation graph 
lt .imshow (crop_img) 

lt .show () 


WG ed O HO DD 


Training of the Main Program 


The entry code of model training is train.py, which shows the various modules required in the training: build dat-— 
aloader, build post process, build model, build loss, build optim, build metric. After 
connecting all the parts, you can start training: 


¢ Building dataloader 


The training model requires the data to be formed into a specified number of batches, which are sequentially yielded 
during the training process. The SimpleDataSet implemented in PaddleOCR is used in this example. 


After silightly modifying the original code, the main logic of returning one piece of data is as follows: 


def _ getitem__(data_line, data_dir): 
import os 
mode = "train" 
cletlstinaliges = V\Ge)) 
try: 
substr = data_line.strip("\n") .split (delimiter) 
file_name = substr[0] 
label = substr[i] 
img_path = os.path.join(data_dir, file_name) 
data = {'img_path': img_path, 'label': label} 
if not os.path.exists(img_path): 
raise Exception("{} does not exist!".format (img_path) ) 
with open(data['img_path'], 'rb') as f: 
img = f.read() 
data['image'] = img 
# Pre-processing operation, comment out first 
# outs = transform(data, self.ops) 
outs = data 
except Exception as e: 
print ("When parsing line {}, error happened with msg: {}".format ( 


data_line, e)) 
outs = None 
return outs 


Suppose the current input label is train/word_1.png Genaxis Theatre and the path of the training data is 
/home/aistudio/work/train_data/ici5_data/. Then the analyzed result is a dictionary containing three 
fields: img_path label image. 
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data_line = "train/word_1.png Genaxis Theatre" 
dataldir = "“/home/aistudio/work/traim data/icl5 data/™ 
item __getitem__(data_line, data_dir) 


print (item) 


After the piece of data is returned, call padde.io.Dataloader to merge the data into a batch. For details, please 
refer to build_dataloader. 


¢ Build model 


The build model is to build the main network structure. Its details are described in “2.3 Code Implementation”, 
and this section will not cover too much. For the code of each module, please refer to modeling 


¢ Build loss 


The loss function of the CRNN model is CTC loss, and PaddlePaddle has collected the commonly used loss func- 
tions. Implement them if needed: 


import paddle.nn as nn 
class CTCLoss(nn.Layer): 
def _init__(self, use_focal_loss=False, **kwargs): 
super (CTCLoss, self).__init__() 
# Blank is a meaningless connector for ctc 
self.loss_func = nn.CTCLoss(blank=0, reduction='none') 


def forward(self, predicts, batch): 
if isinstance(predicts, (list, tuple)): 
predicts = predicts[-1] 
# Transpose the prediction results of the head layer of the model and arrange. 
othem along the channel layer 
predicts = predicts.transpose((1, 0, 2)) #[80,1,37] 
N, B, _ = predicts.shape 
preds_lengths = paddle.to_tensor([N] * B, dtype='int64"') 
labels = batch[1].astype("int32") 
label_lengths = batch[2].astype('int64') 
# Calculate the loss function 
loss = self.loss_func(predicts, labels, preds_lengths, label_lengths) 
loss = loss.mean() 
return {Uloss!: losis} 


¢ Build post process 

The details are also introduced in “2.3 Code Implementation”, and the implementation logic is the same as before. 
¢ Build optim 

The optimizer uses Adam and also calls the API of PaddlePaddle: paddle.optimizer.Adam 
¢ Build metric 


The metric is used to calculate model indicators. In PaddleOCR’s text recognition, correct prediction of the whole sentence 
is equal to correct prediction. Therefore, the main logic of the accuracy rate calculation is as follows: 


def metric(preds, labels): 
correct_num = 0 
all_num = 0 
norm_edit_dis = 0.0 
for (pred), (target) in zip(preds, labels): 
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pred = pred.replace(" ", "") 
target = Earget replace(” M7 TN) 
if pred == target: 
correct_num += 1 
all_num += 1 
correct_num += correct_num 
all_num += all_num 
return { 
Uscel Ss Coiciaacie snmbid // eiilak soqein, 


joie: = [Were Wieloio, Wrereye,, UNL ey 4 iy] 

labelsy— i [(Madall Wbbit, | Vdddu alas" aaa) 

acc = metric(preds, labels) 

Pein (Wace ace) 

# Among the five prediction results, 3 are completely correct, so the accuracy rate. 
oshould be 0.6 


Combine the above parts and get the complete training process: 


def main(config, device, logger, vdl_writer): 
# Init dist environment 
LE (config | "Global |) duster tbuced 5: 
dist.init_parallel_env() 


global_config = config['Global'] 


# Build dataloader 
train_dataloader = build_dataloader(config, 'Train', device, logger) 
if len(train_dataloader) == 0: 
logger.error ( 
"No Images in train dataset, pleas nsure\n" + 
"\t1. The images num in the train label_file_list should be larger thanw 
sor equal with batch size.\n" 
+ 
"\t2. The annotation file and path in the configuration file are providedu. 
enormally." 
) 


return 


aie (erent saker (| Vawieull "3 

valid_dataloader = build_dataloader(config, 'Eval', device, logger) 
else: 

valid_dataloader = None 


# Build post process 
post_process_class = build_post_process(config['PostProcess'], 
global_config) 


# Build model 
# For rec algorithm 
if hasattr(post_process_class, 'character'): 
char_num = len(getattr(post_process_class, 'character') ) 
aie Glomgueaver || VAvaelalicecieuucsy! |) |) Veulkeroresticloim)) si) [| ipabejeabilileneabora 
]: # distillation model 
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for key in config['Architecture'] ["Models"]: 


config['Architecture'] ["Models"] [key] ["Head"] [ 
"out_channels'] = char_num 
else: # base rec model 
config['Architecture'] ["Head"] ['out_channels'] = char_num 


model = build_model (config['Architecture']) 
LE conéigi[' Global” || \dustributed" |): 
model = paddle.DataParallel (model) 


# Build loss 
loss_class = build_loss(config['Loss']) 


# Build optim 

optimizer, lr_scheduler = build_optimizer ( 
Conc Opies zeal, 
epochs=config['Global']['epoch_num'], 
step_each_epoch=len(train_dataloader), 
parameters=model.parameters () ) 


# Build metric 
eval_class = build_metric(config['Metric']) 
# Load pretrain model 
pre_best_model_dict = load_model (config, model, optimizer) 
logger.info('train dataloader has {} iters'.format (len (train_dataloader) ) ) 
if valid_dataloader is not None: 

logger.info('valid dataloader has {} iters'.format ( 

len(valid_dataloader) ) ) 


use_amp = config["Global"].get("use_amp", False) 
if use_amp: 
AMP_RELATED_FLAGS_SETTING = { 
"FLAGS_cudnn_batchnorm_spatial_persistent': 1, 
"FLAGS _max_inplace_grad_add': 8, 


} 

paddle. fluid.set_flags (AMP_RELATED_FLAGS_SETTING) 

scale_loss = config["Global"].get("scale_loss", 1.0) 

use_dynamic_loss_scaling = config["Global"] .get ( 
"use_dynamic_loss_scaling", False) 

scaler = paddle.amp.GradScaler ( 
init_loss_scaling=scale_loss, 
use_dynamic_loss_scaling=use_dynamic_loss_scaling) 


else: 
scaler = None 


# Start training 

program.train(config, train_dataloader, valid_dataloader, device, model, 
lossuclasis,, Optimizer, Ieeschedullex,) postuprmocess= classy, 
eval_class, pre_best_model_dict, logger, vdl_writer, scaler) 


108 Chapter 5. Text Recognition 


Dive into OCR 


Starting Training 


PaddleOCR recognition task is similar to the detection task, transmitting parameters through configuration files. 


To perform a complete model training, first download the entire project and install related dependencies: 


# Clone PaddleOCR code 

#!git clone https://gitee.com/paddlepaddle/PaddleOCR 

# Modify the default directory where the code runs to /home/aistudio/PaddleOCR 
import os 

os.chdir ("/home/aistudio/PaddleOCR") 

# Install PaddleOCR third-party dependencies 

‘pip install -r requirements.txt 


Create a soft link and place the training data under the PaddleOCR project: 


'ln -s /home/aistudio/work/train_data/ /home/aistudio/PaddleOCR/ 


Download the pre-trained model: 


In order to speed up the convergence, it is recommended to download the trained model and finetune it on the icdar2015 
dataset. 


!'cd PaddleOCR/ 

# Download the pre-trained model of MobileNetV3 

'wget -nc -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_ 
omv3_none_bilstm_ctc_v2.0_train.tar 

# Decompress model parameters 

tar =xf pretrain models/rec mv3 none _bilstm cte v2.0 train tar && rm =ri pretrain— 
omodels/rec_mv3_none_bilstm_ctc_v2.0_train.tar 


It is simple to start the training command by just specifying the configuration file. In addition, in the command line, you 
can use —o to modify the parameters in the configuration file. How to start the training command are shown below 


in: 


obal.pretrained_model1: loaded pretrained model path 


obal.character_dict_path: dictionary path (only 26 lowercase letters + numbers are supported here) 


obal.eval_batch_step: evaluation frequency 


a 


obal.epoch_num: total number of training rounds 


lpythons tools/train.py se Contigs/rec/recoicdarlsstrainaym) \ 
-o Global.pretrained_model=rec_mv3_none_bilstm_ctc_v2.0_train/best_accuracy \ 
Global character ductapathn—ppocr/utatlis/Teloedictr txt \ 
Global.eval_batch_step=[0,200] \ 
Global.epoch_num=40 


According to the save_model_dir field set in the configuration file, the following parameters will be saved: 


output/rec/ici5 

t— best_accuracy.pdopt 
t— best_accuracy.pdparams 
t— best_accuracy.states 
t— config.yml 

t— iter_epoch_3.pdopt 


(continues on next page) 
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(continued from previous page) 


t— iter_epoch_3.pdparams 
t— iter_epoch_3.states 
t— latest.pdopt 

t— latest.pdparams 

t[— latest.states 

— train.log 


And best_accuracy.* is the optimum model on the evaluation set; iter_epoch_x.* is the model saved at 
intervals of save_epoch_step; latest. * is the last model of the epoch. 


Summary: 

Training your own data requires: 
1. Training and evaluation of data paths (necessary) 
2. Dictionary path (necessary) 
3. Pre-trained model (optional) 


4. Learning rate, image shape, and network structure (optional) 


Model Evaluation 
To evaluate datasets, you can modify the setting of label_file_path in Eval through configs/rec/ 
rec_icdar15_train.yml. 


The default evaluation set here is that of icdar2015 which is for loading weights of the newly trained model: 


'python tools/eval.py -c configs/rec/rec_icdar1i5_train.yml -o Global. 
ocheckpoints=output/rec/ic15/best_accuracy \ 
Global.character_dict_path=ppocr/utils/ici5_dict.txt 


After the evaluation, you can see the accuracy of the training model on the validation set. 


PaddleOCR supports alternate training and evaluation. To make it, you can modify the evaluation frequency of 
eval_batch_step in configs/rec/rec_icdar15_train.yml. The default evaluation frequency is once 
every 2000 iter. In the evaluation process, the best acc model is saved as output /rec/ic15/best_accuracy by 
default. 


If the validation set is large, the test will be more time-consuming. Under these circumstances, it is recommended to 
reduce evaluations or perform them after training. 


Prediction 


The model trained by PaddleOCR can go on quick predictions by the following script. 


Prediction picture: 


The default prediction image is stored in infer_img, and the trained parameter file is loaded through -o Global. 
checkpoints: 
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‘python tools/infer_rec.py -c configs/rec/rec_icdar1i5_train.yml -o Global. 
ocheckpoints=output/rec/ici5/best_accuracy Global.character_dict_path=ppocr/utils/ 
Sahel oy ielakoie Se ic. 


Get the prediction result of the input image: 


infer_img: doc/imgs_words_en/word_19.png 
result: slow 0.8795223 


5.2.4 Text Recognition FAQ 
Universal Questions 


Q1.1: What are the text recognition algorithms provided by PaddleOCR? 


A: There are five major text recognition algorithms, including CRNN\StarNet\RARE\Rosetta and SRN. Among them, 
CRNN\StarNet\Rosetta are based on ctc, RARE is based on attention, and SRN is a Baidu’s self-developed algorithm 
where semantic information is introduced and thus the accuracy has been improved. For details, please refer: text recog- 
nition algorithms. 


Q1.2: What are the key CRNN technologies in text recognition? 


A: There are three.(1) CNN image feature extraction, (2) deeply bi-directional LSTM network which further extracts 
sequence features of texts, (3Plconnectionist temporal classification(CTC), which can solve the problem of character 
alignment. 


Q1.3[2|Which is better, CTC or Attention, to recognize text lines in Chinese? 


A: (1) In terms of the effect, CTC performs better than Attention in common OCR scenarios. It is because that the 
dictionary to be recognized has many characters including more than 3,000 common Chinese characters. If lacking 
training samples, it will be difficult to figure out their sequences. Under these circumstances, the Attention model has no 
advantages. Also, the Attention model is more suitable for short sentence recognition instead of long sentence recognition. 


P12PIn terms of the speed of training and inference, compared with the serial decoder of Attention which limits the speed, 
the structure of CTC is more efficient and can achieve faster inference. 
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Q1.4)2/How to recognize the curved text? What are the application scenarios of TPS? Does TPS 
work well? 


A: (1) In most cases, if the text is not severely curved, then detect its four vertexes and use affine transformation to rotate 
and recognize the image. 


PI2PIIf this cannot meet your requirement, you can try TPS (Thin Plate Spline).TPS is an interpolation approach by using 
several control points to change images, usually used for image morphing. It is often adopted in the recognition of curved 
texts. When the text is detected to be irregular or curved by, for example, segment-based methods, TPS is the first choice 
to correct the text area into a rectangle and then recognize it. The TPS module has been introduced into STAR-Net and 
RARE algorithms. 


Warning[Though TPS works well in theory, it lacks the robustness in application and makes recognition 
more time-consuming. Therefore, think twice before using TPS. 


The Application of Text Recognition in Vertical Scenarios 


Q 2.1[2|How to recognize texts blurred by the background (such as the signature on which the seal 
is stamped or the text of the seal) ? 


AQ? (1) If the text is confirmed to be recognizable with the naked eye, you should first make sure that the detection box is 
correct. If not, it is necessary to consider pretraining images with color filtering and adding more relevant training data. 
During the recognition, add augmentation images where texts are blurred by the background to the training data. 


(2) If the MobileNet cannot meet the demand, you can try larger models such as ResNet. 


Q2.2/|Are there any image enhancement methods to handle some slightly blurred texts? 


AP If the text is confirmed to be recognizable with the naked eye, you might consider adding blur data augmentation 
such as mean filter, the median filter, the gaussian filter or other fuzzy operators of image processing. You can also 
try to strengthen the robustness of models through data augmentation perturbation. It is also feasible to try adversarial 
training and super resolution (SR). But, there is no best solution which is widely accepted by professions. Therefore, it is 
recommended to set some limitations in data collection to improve the image quality. 


Q2.3/2/Are there any SR solutions to recognizing low-resolution texts or those with a small font size 
? 


APJ Super resolution methods consist of traditional and deep-learning-based methods. Among DL-based ones, SRCNN is 
a classic solution, and there is related paper of CVPR 2020: Unpaired Image Super-Resolution using Pseudo-Supervision. 
But the paper is not proved by enough practice. 
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Q2.422|How to recognize long texts? 


Af|During the training of Chinese recognition models, instead of directly shrinking images to [3,32,320], you should 
scale them down in proportion to 32 in height. If their width is less than 320, filled it with 0. If the image ratio of width 
to height is more than 10, remove the image. In the inference, when it comes to the inference of a single image, scale 
down the image following the above instruction and don’t set the limitation of the width; when it comes to the inference 
of several images, adopt the batch solution by choosing the maximum from the width range of every batch. 


Q2.5[2|How to recognize English text lines with blanks? 


AlThere are two solutions to blank recognition: 


(1)Optimize text detection algorithms. The text will be broken at blanks. This solution divides a text line contaning blanks 
into many parts during the detection of data annotations. 


(2)Optimize text recognition algorithms. Introduce space characters into the recognition dictionary, and then annotate 
the training data which uses a blank line. What’s more, in data synthesis, produce texts with blanks by jointing training 
data. 


Q2.6: Have curved texts been rectified by TPS of OpenCV? 


AThe points of upper and lower boundaries are required to be labelled if you use TPS of OpenCV. The points can 
hardly be acquired by traditional or DL-based methods. In PaddleOCR, the TPS module of StarNet can learn the points 
and rectify texts automatically. You can have a try. 


Q2.7: How to recognize the wordart on the signboard or the advertisment? 


A: This is very challenging because the style of the wordart is very different from that of printed letters. If all the wordart 
is in the same dictionary list, you can turn each dictionary into an image template to be recognized, and then use the 
general image retrieval and recognition system. You can try the image recognition system of PaddleClas. 


Q2.8: How to recognize the seal text? 


AM1. Use the recognition network with TPS or ABCNet, 2. flatten the image through polar coordinate system transfor- 
mation, and then use CRNN. 
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Q2.9: How to cope with poor performance of recognizing specific characters when using pretrained 
models for inference? 


A: Since the provided recognition models are trained with universal large datasets, some characters might be rarely 
contained in the training set. You can build datasets for specific scenarios and fine-tune them based on the provided 
models. It is suggested the number of each character is no less than 300, and the number of different characters is 
balanced. For details, please refer to: fine-tuning. 


Q2.10[2|How to handle the repeated characters in the inference of trained recognition models? 


APYou can check whether the dimension of training is the same as that of inference. If the dimension of training is [3, 
32, 320], and the dimension of inference is [3, 64, 640], there are many repeated characters being recognized. 


Q2.11 How to recognize ancient Chinese characters on bamboo slips? 


AP|When they are common Chinese characters, just annotate the adequate amount of data and fine-tune models. If there 
is not enough data, you can try StyleText. But when they are special characters like inscriptions on oracle and hieroglyphs, 
it is necessary to make a specialized dictionary and train models. 


Q2.12: How to recognize texts in videos with PaddleOCR? 


A: Now PaddleOCR mainly deals with images. If you want to recognize texts in videos, you can first extract frames from 
videos and then use PaddleOCR for text recognition. 


Q2.13: How to fine-tune the model when encountering characters incompatible with the Chinese- 
English recognition model? 


APYou may need to update the recognition dictionary and fine-tune the model. If you also hope that the recognition 
model can cope with Roman numerals, please follow the steps here to fine-tune the model: 


1. Prepare Chinese and English recognition data and Roman numerals for training, and ensure their quality; 
2. Rectify the default dictionary file by adding characters of Roman numerals at the end; 


3. Download the pretrained model offered by PaddleOCR, configurate the model and the data path, and then start the 
training. 
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Q2.14[2|How to cope with poor recognition results of some special characters like some punctuation 
marks? 


APJFirst you need to confirm whether the special characters are contained in the dictionary. If the characters are in the 
dictionary but the recognition results are still poor, you can increase relevant data to fine-tune the model since the poor 
performance might be caused by shortage of recognition data. 


Q2.15[2|How to handle an image with different kinds of texts (for example, there are printed texts 
and written words) ? 


A[This is very common. Take an exam paper as an example. It has both printed texts and written words. Under this 
circumstance, you can adopt the solution of “1 detection model+ 1 N-class model+ N recognition models”. All types of 
texts share one detection model, and the N-class model which is a classifier in the extra training classifies the texts. If 
there are printed and written texts, the model is a two-class model; if there are N types of texts, the model is an N-class 
model. When it comes to recognition, every text type trains a recognition model. If there are printed and written texts in 
an image, two recognition models are needed, one for printed texts and the other for written texts. Then, when a text box 
is classified as a written text, then it will be recognized by the specialized model. The same goes for other cases. 


Q2.16 Is there anything special about the dictionary containing different types of texts? How much 
loss of accuracy will the diversity cause? 


APIt may lead to the oversized FC in the last layer and the model size increase. If you have any special requirements, 
you can merge dictionaries of different text types you need into one dictionary and use it to train the model. But if you 
introduce too many close words, there may be a loss of accuracy. And the issue of character balance also needs to be 
taken into account. You can separate the dicitionaries in PaddleOCR for the time being. 


Q2.17[2|In some small languages like Thai, one word may take up two to three characters. So how 
to make a dictionary in such cases? 


AfSee characters of the same word as a whole. Make sure there is only one word in each line. 


Training Process and Model Optimization 


Q3.1: Increasing batch_size cannot largely improve the model training speed. 


API In this case, you can consider increase the value of the initial memory and set the environment variable before running 
the code: export FLAGS_initial_cpu_memory_in_mb=2000 # Set the initial memory to about 2G. 
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Q3.2: What to do if there are reminders of the oversized image and video and internal memory 
leaks? 


A: You can relieve the leaks by following the PR. #2230 


Q3.32]In the recognition training, though the precision of the training set reaches 90, the precision 
of the verification set is still 70. How to handle this case? 


AP] This may be due to overfitting. There are two solutions you can try: (1) Introduce more augmentation methods or 
increase the probability of augmentation, and the default value is 0.4; (2) increase the 12 decay value of the system. 


5.2.5 Assignment 


[Task 1] 


Visualize the Data Augmentation results implemented in PaddleOCR: noise, jitter, and explain the effect in language . 


Optional test picture: 


[Task 2] 


Replace the backbone in the configuration of configs/rec/rec_icdar15_train.yml with ResNet34_vd in PaddleOCR When 
the input image shape is (3, 32, 100), what is the final output feature size of the head layer? 


[Task 3] 


Download the 10W Chinese dataset rec_data_lesson_demo, modify the configs/rec/rec_icdar15_train.yml] configuration 
file to train a recognition model and offer the training log. 


Loadable pre-training model: https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mv3_none_bilstm_ctc_v2.0_train. 
tar 


5.3 Summary 


This chapter introduces the theory and practice of text recognition algorithms. 


Th first section has introduced the theoretical knowledge and mainstream algorithms related to text recognition, including 
the CTC-based methods, the Sequence2Sequence-based methods, and the segmentation-based methods. The ideas and 
contributions of classic papers are presented respectively. 


The next section is the practical course based on the CRNN algorithm, in which the whole training process will be 
illustrated from networking to optimization. For more functions and codes, please refer to PaddleOCR. 
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CHAPTER 
SIX 


PP-OCR SYSTEM AND STRATEGY 


6.1 Introduction of PP-OCR 


The first two chapters talk about the DBNet text detection algorithms and CRNN text recognition algorithms. However, in 
actual scene, it is impossible to obtain the text position and text content of an image simultaneously solely based on the text 
detection or recognition model. Therefore, we thread the text detection algorithms and the text recognition algorithms to 
build the PP-OCR text detection and recognition system. In the actual use, the detected text direction may be not what we 
expect, which will lead to errors in recognition. Therefore, a direction classifier has also been introduced in the PP-OCR 
system. 


This chapter mainly introduces the PP-OCR text detection and recognition system and the involved optimization strategies. 
In this section, you can learn: 


¢ PaddleOCR strategy refining skills 
¢ Optimization techniques and methods for text detection, recognition, and direction classifier models 


The PP-OCR system has undergone two phases of optimization. The following part will briefly intorduce the system and 
its two optimizations. 


6.1.1 Introduction to PP-OCR System and Optimization Strategies 


In PP-OCR, if you want to extract the text from an image, you need to: 


¢ Use text detection methods to obtain the polygon of the text area (DBNet is used in the text detection in PP-OCR, 
which will get four-point text boxes). 


¢ Crop and apply perspective transformation correction to the polygon area, convert the text region into a rectangular, 
and then use the direction classifier to correct its direction. 


¢ Recognize the text in the rectangular box and get the recognition result. 
After all these, the text detection and recognition for an image are finished. 


The frame diagram of PP-OCR is shown below. 
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Figure 1: PP-OCR system frame diagram 


Text detection adopts DBNet whose post-processing a relatively simple. Text region correction mainly uses geometric 
transformation and the direction classifier. Text recognition makes use of CRNN, a combination of convolution and 
sequence features, and applies CTC loss to solving the inconsistency of prediction results and labels. 


PP-OCR uses 19 strategies in the backbone network, learning rate strategy, data augmentation, and model tailoring and 
quantification, to optimize and slim the model, and create a PP-OCR server system and a PP-OCR mobile system. 


6.1.2 Introduction to PP-OCRv2 System and Optimization Strategies 


Compared with PP-OCR, PP-OCRv?2 further improves the backbone network, data augmentation, and loss function to 
tackle the poor end-to-side prediction efficiency, complex background, and misrecognition of similar characters. At the 
same time, it introduces the knowledge distillation training strategy to increase the model accuracy. Its improvement 
includes: 


¢ Detection model optimization: (1) CML (Collaborative Mutual Learning) knowledge distillation strategy; (2) 
CopyPaste data augmentation strategy; 


¢ Recognition model optimization: (1) PP-LCNet; (2) U-DML updated knowledge distillation strategy; (3) Enhanced 
CTC loss. 


There are three main improvements in the effect: 
¢ In terms of the model effect, PP-OCRv2 outperforms the PP-OCR mobile version by over 7%; 
¢ In terms of the speed, PP-OCRv2 outperforms the PP-OCR server version by more than 220%; 


¢ In terms of model size, PP-OCRv2 can be easily deployed on the server or mobile platforms with a total size of 
11.6M. 


The comparison of the accuracy, prediction time, and model size between the PP-OCRv2 model and the previous models 
of PP-OCR series is shown below. 
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Figure 2: Comparison of speed, accuracy and model size between PP-OCRv2 and PP-OCR 


The frame diagram of PP-OCRv2 is shown below. 


OF} ODM OEM 


or [azo O 
‘ Detection Boxes Text Recognition 
EI en prnosaiee a mai =n FERNY a 


+ Data Augmentation 
+ Feature Map Resolution 
+ Regularization Parameters 


* MobileNetV3 


+ Light Head + Learning Rate Warm-up 


+ Light Head 
+ Pre-trained Model 


+ Remove SE 
+ Cosine Learning Rate Decay 
* Learning Rate Warm-up 


+ MobileNetV3 
+ Data Augmentation 


+ Input Resolution 


+ Collaborative Mutual Learning ; Uae aaienie 
+ CopyPaste + CenterLoss 


oa + FPGM Pruner + PACT Quantization + PACT Quantization 


PP-OCR v2 


Figure 2: The framework of the proposed PP-OCRv2. The strategies in the green boxes are the same as PP-OCR. The strategies 
in the orange boxes are the newly added ones in the PP-OCRv2. The strategies in the gray boxes will be adopted by the 
PP-OCRv?2-tiny in the future. 


Figure 3: PP-OCRv2 system frame diagram 


This chapter will give a detailed interpretation of optimization strategies of the PP-OCR and PP-OCRv2. 


6.2 PP-OCR Optimization Strategies 


The PP-OCR system includes a text detector, a direction classifier and a text recognizer. This section introduces the 
model optimization strategies in these three directions in detail. 
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6.2.1 Text Detection 


The text detection in PP-OCR is based on the DBNet (Differentiable Binarization) model, which is based on the segmen- 
tation, and simple in post-processing. The figure below shows the model structure of DBNet. 
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Figure 4: DBNet frame diagram 


DBNet extracts features through the backbone, and uses the structure (neck) of DBFPN to fuse the features to obtain the 
combined features. The fused features are decoded by operations (head) such as convolution to generate a probability 
map and a threshold map. And an approximate binary map is calculated with the two maps combined. When calculating 
the loss function, the loss function is calculated for these three feature maps. Here, the binary supervision is also applied 
to the training so that the model can learn more accurate boundaries. 


Six optimization strategies are used in DBNet to improve model accuracy and speed, including the backbone network, 
FPN, head structure, learning rate strategy, and model cropping. On the validation set, the conclusions of the ablation 
experiment2 for different modules are shown below. 


Cosine Learning 


eee wary sar diag Pecan Rate Precision | Recall | HMean ean a gla 
ate Decay | Warm-up 
256 0.6821 | 0.5560 | 0.6127 7 406 
96 0.6677 | 0.5524 | 0.6046 41 213 
96 Vv 0.6952 | 0.5413 | 0.6087 2.6 173 
96 J JV 0.7034 | 0.5404 | 0.6112 2.6 173 
96 Vv Vv Vv 0.7349 | 0.5420 | 0.6239 2.6 173 


Figure 5: DBNet Ablation Experiment 


The detailed description is given below. 


Lightweight Backbone Network 


The size of the backbone network has an important influence on the model size of the text detector. Therefore, a 
lightweight backbone network should be selected to build an ultra-lightweight detection model. With the development of 
image classification technology, MobileNetV1, MobileNetV2, MobileNetV3 and ShuffleNetV2 series are often used as 
lightweight backbone networks. Each series has a different model size and performance. PaddeClas provides more than 
20 kinds of lightweight backbone networks. Their Accuracy—Speed curve on ARM is shown below. 
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Performance of the mobile models 


MobileNetV3_large_x0_35 
MobileNetV3_large_x0_5 
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MobileNetV3_small_x0 35 
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Figure 6: The "speed-accuracy" curve of the backbone network in PaddleClas 


When the prediction time is the same, the MobileNetV3 series can achieve higher accuracy. In order to cover as many 
scenes as possible, the author uses the parameter scale to adjust the number of feature map channels in design. The 
standard value is Ix. If it is 0.5x, it means that the number of feature map channels in the network is 0.5 times of the 
1x corresponding network. In order to balance the accuracy and efficiency, when chossing the size of V3, we adopt the 
structure of MobileNetV3_large 0.5x. 


The feature map size of each stage of MobileNetV3 in DBNet is printed out below. 


import os 
import sys 


# Download code 

os.chdir("/home/aistudio/") 

'git clone https://gitee.com/paddlepaddle/PaddleOCR. git 
# Switch working directory 
os.chdir("/home/aistudio/PaddleOCR/") 

Ujjal) alialisheyelil iL I) julie) 

!pip install -r requirements.txt 


# For the detailed code implementation, refer to: 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/modeling/backbones/ 
odet_mobilenet_v3.py 

import numpy as np 

import paddle 


# Set random input 


(continues on next page) 
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(continued from previous page) 


inputs = np.random.rand(1, 3, 640, 640) .astype(np.float32) 
x = paddle.to_tensor (inputs) 


# Import MobileNetV3 library 
from ppocr.modeling.backbones.det_mobilenet_v3 import MobileNetv3 


# Model definition 
backbone_mv3 = MobileNetV3 (scale=0.5, model_name='large') 


# Model forward 
bk_out = backbone_mv3 (x) 


# Model middle layer printing 
for i, stage_out in enumerate (bk_out): 
print ("the shape of ",i,'stage: ',stage_out.shape) 


the shape of 
the shape of 
the shape of 
the shape of 


stage: [ee Ge A607 6.0] 
stage: Lo, 24, 60, 30] 
stage: [i, 56, 40, 40] 
stage: [i, 480, 20, 20) 


CO DOs ES: 


Lightweight Feature Pyramid Network DBFPN Structure 


The feature fusion (neck) part of the text detector, DBFPN, is similar in structure to the FPN in the target detection, It 
fuses feature maps of different scales to improve the detection effect of text regions of different scales. 


To facilitate the merging of feature maps of different channels, a convolution of 1 x1 is used to reduce the channel number 
of feature maps to the same amount. 


The probability map and the threshold map are generated by the feature map fused by convolution, and the convolution 
is also associated with inner_channels. Therefore, inner_channels has a great influence on the model size. When in- 
ner_channels is reduced from 256 to 96, the model size is reduced from 7M to 4.1M, and the speed is increased by 48%, 
but the accuracy is only slightly reduced. 


The structure of DBFPN and the fusion result of the backbone network feature map are printed below. 


# For detailed code implementation, refer to: 
# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/modeling/necks/db_ 


ofpn.py 
from ppocr.modeling.necks.db_fpn import DBFPN 


neck_bdfpn = DBFPN(in_channels=[16, 24, 56, 480], out_channels=96) 
# Print DBFPN structure 
print (neck_bdfpn) 


# First reduce the number of original channels to 96, then to 24, and to 4 concatu 
ofeature maps 
fpn_out = neck_bdfpn (bk_out) 


print ('the shape of output of DBFPN: ', fpn_out.shape) 


DBFPN ( 
(in2_conv): Conv2D(16, 96, kernel_size=[1, 1], data_format=NCHW) 


(continues on next page) 
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(continued from previous page) 


(in3_conv): Conv2D(24, 96, kernel_size=[1, 1], data_format=NCHW) 
(in4_conv): Conv2D(56, 96, kernel_size=[1, 1], data_format=NCHW) 
(in5_conv): Conv2D(480, 96, kernel_size=[1, 1], data_format=NCHW) 
(p5_conv): Conv2D(96, 24, kernel_size=[3, 3], padding=1, data_format=NCHW) 
(p4_conv): Conv2D(96, 24, kernel_size=[3, 3], padding=1, data_format=NCHW) 
(p3_conv): Conv2D(96, 24, kernel_size=[3, 3], padding=1, data_format=NCHW) 
(p2_conv): Conv2D(96, 24, kernel_size=[3, 3], padding=1, data_format=NCHW) 
) 


the shape of output of DBFPN: [ie 26, 160. 260) 


SE Module Analysis in Backbone Network 


SE is the abbreviation of squeeze-and-excitation (Hu, Shen, and Sun 2018). 
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Figure 7: Schematic of SE module 


The SE module explicitly models the interdependence between the channels and adaptively recalibrates the characteristic 
response of channels. The use of SE modules in the network can significantly improve the accuracy of vision tasks. 
Therefore, the search space of MobileNetV3 contains SE modules, so do the MobileNetV3. However, when the input 
resolution is large, such as 640 x 640, it is difficult to estimate the characteristic response of channels using the SE module, 
and the accuracy improvement is limited, and it is very time-cosnuming. In DBNet, we remove the SE module from 
the backbone network, and reduce the model size from 4 . 1M to 2. 6M, but the accuracy remains the same. 


In PaddleOCR, you can remove the SE modules in the backbone network by setting disable_se=True as shown 
below. 


# For detailed code implementation, refer to: 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/modeling/backbones/ 
«det_mobilenet_v3.py 

x = paddle.rand([1, 3, 640, 640]) 


from ppocr.modeling.backbones.det_mobilenet_v3 import MobileNetv3 


# Define model 
backbone_mv3 = MobileNetV3 (scale=0.5, model_name='large', disable_se=True) 


# Model forward 
bk_out = backbone_mv3 (x) 
# Output 
for i, stage_out in enumerate (bk_out): 
print ("the shape of ",i, 'stage: ',stage_out.shape) 
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Learning Rate Sirategy Optimization 


¢ Cosine Learning Rate Reduction Strategy 


In the gradient descent algorithm, we need to set a value to control the weight update amplitude, which is called the 
learning rate, a hyperparameter that controls the learning speed of the model. The smaller the learning rate is, the slower 
the loss function changes. Although a lower learning rate can ensure that no local minimum is missed, but the model will 
converge slowly. 


Therefore, in the early stage of training, the weights are in random initialization, and we can set a relatively large learning 
rate to speed up the convergence. In the later stage, as the weight is close to the optimal value, and a small learning rate 
can prevent the model from oscillating during the convergence. 


Therefore, the cosine learning rate strategy came into being. It refers to the learning rate that changes according to 
the cosine curve in the training. During the entire training process, the cosine learning rate decay strategy enables the 
network to maintain a relatively large learning rate in the initial stage, and the learning rate will gradually decay to 0 later. 
Its convergence speed is relatively slow, but the convergence accuracy is great. The figure below compares two different 
learning rate decay strategies: piecewise decay and cosine decay. 


Comparison of different ways of learning rate decay 


0.10 — cosine decay 
—— piecewise decay 


0.08 
0.06 
0.04 
0.02 
0.00 
0 20 40 60 80 100 120 
epoch 


Figure 8: Cosine and Piecewise learning rate decay strategy 


e Learning Rate Warm-up Strategy 
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When the model is first trained, the model weights are initialized randomly. At this time, if a larger learning rate is 
selected, the model training may be unstable. Therefore, the concept of learning rate warm-up is proposed to solvethe 
problem of non-convergence in the early stage of model training. 


Learning rate warm-up refers to gradually increasing the learning rate from a small value to a larger one at the beginning 
of training. It can ensure the stability of the model at the beginning of training. The strategy can improve the accuracy 
of image classification. Experiments show that this strategy is also effective in DBNet. When the learning rate warm-up 
strategy is combined with the cosine learning rate, the changing trend of the learning rate is shown in the following code. 


# For detailed code implementation, refer to: 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/optimizer/__init__. 
Sy, 

# Import the function built by the learning rate optimizer 

from ppocr.optimizer import build_lr_scheduler 

import numpy as np 

import matplotlib.pyplot as plt 

smatplotlib inline 

# Let's see the effect when warmup_epoch is 2 

lr_config = {'name': 'Cosine', 'learning_rate': 0.1, 'warmup_epoch': 2} 
epochs = 20 # config['Global']['epoch_num'] 

iters_epoch = 100 # len(train_dataloader) 
lr_scheduler=build_lr_scheduler(lr_config, epochs, iters_epoch) 


iters = 0 
ie = |i] 
for epoch in range(epochs) : 
for _ in range(iters_epoch): 
lr_sduler.step() # ME https://github.com/PaddlePaddle/PaddleOCR/blob/release/ 
o+2.4/tools/program.py#L262 
iters += 1 
lr.append(lr_scheduler.get_l1r()) 


x = np.arange(iters,dtype=np.inté64) 
y = np.array (lr, dtype=np.floaté4) 


plt.figure(figsize=(15, 6)) 

plt.plot (x, y,color='red',label='lr') 
plt.title(u'Cosine lr scheduler with Warmup') 
plt.xlabel(u'iters') 

pitteylabed Guile) 

plt.legend() 

plt.show() 
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Cosine Ir scheduler with Warmup 
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Model Tailoring Strategy-FPGM 
There are many redundant parameters in deep learning models. We can use some methods to remove redundant parts 
and improve the efficiency of model inference. 


Model tailoring refers to the acquisition of a more lightweight network by removing redundant channels, filters, neurons, 
and so on on the premise of maintaining the accuracy of the model. 


Compared with tailoring the channel or the feature map, tailoring the filter can get a more regular model, thus reducing 
the memory consumption and accelerating the model inference. 


The previous versions are mostly based on the norm. That is, the filter with a smaller norm is considered to be less 
important, but this kind of methods require the minimum norm of the existing filter to be close to 0, otherwise it is 
difficult to be removed. 


In this case, Filter Pruning via Geometric Median (FPGM) is proposed. FPGM treats each filter in the convolutional 
layer as one point in Euclidean space, and it introduces the concept of geometric median, which refers to the point with 
the smallest sum of distances from all sampling points. If a filter is close to this geometric median, then it is probable that 
the information of this filter overlaps with other filters and can be removed. 


The comparison between FPGM and norm-based tailoring algorithm is shown in the figure below. 


Figure 9: Schematic of FPGM tailoring 
In PP-OCR, we use FPGM to tailor the detection model. In the end, the model accuracy of DBNet just decreases slightly, 
but the model size is reduced by 46%, and the prediction speed is accelerated by 19%. 
For more details on the implementation of FPGM model tailoring, please refer to PaddleSlim. 
Notice: 
1. Model tailoring requires to retrain the model. For this, please refer to PaddleOCR Pruning Tutorial. 


2. The code tailoring is adapted for DBNet. If you need to prune your own model, re-analyze the model structure and 
the sensitivity of parameters. Usually it is only recommended to cut the parameters of relatively low sensitivity, 
and skip sensitive ones. 


3. The tailoring rate of each convolutional layer is also very important for the performance of the tailored model. 
Using the identical tailoring rate will usually result in significant performance degradation. 
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4. Model tailoring cannot be accomplished overnight, Only through repeated experiments can a model that meets the 
requirements be tailored to. 


Description of Text Detection Configuration 


A brief description of the training configuration of DBNet is given below, and for complete configuration file, refer 
to:ch_det_mv3_db_v2.0.ymlP] 


Architecture: 
model_type: det 
algorithm: DB 
Transform: 
Backbone: 


name: MobileNetV3 


scale: 0.5 

model_name: 

disable_se: 
Neck: 


name: DBFPN 


large 
True 


out_channels: 96 


Head: 
name: 
k: 50 


DBHead 


Optimizer: 

name: Adam 

betal: 0.9 

beta2: 0.999 

ir; 
name: Cosine 
learning_rate: 
warmup_epoch: 2 

regularizer: 
name:'L2' 
factor: 0 


0.001 


# Define model structure 


# Configure 


# Remove SE 


# Configure 
# Configure 


backbone network 


modules 


DBFPN 
inner_channels 


# Configure cosine learning rate decay strategy 
# Initial learning rate 
# Configure learning rate warm-up strategy 


# Configure 


L2 regularizer 


# The weight of the regularizer 


Summary of PP-OCR Detection Optimization 


The last section introduces the optimization strategies of the text detection algorithm in PP-OCR, and this part will review 
the ablation experiments and conclusions corresponding to different optimization strategies. 


inner_channel Cosine Learning Model Inference Time 
Learning Rate Precision | Recall | HMean : ' 
of the head R Size (M) (CPU, ms) 
ate Decay | Warm-up 
256 v/ 406 
96 4.1 213 
96 2.6 173 
96 2.6 173 
96 0.5420 2.6 173 


Figure 10: DBNet Ablation Experiment 


Through strategies such as the lightweight backbone network, lightweight neck structure, SE module analysis and removal, 
learning rate adjustment, and model tailoring, the model size of DBNet is reduced from 7M to 1.5M. Its model accuracy 
has increased by more than 1% through improving training strategies such as the laerning rate strategy. 
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In PP-OCR, the ultra-lightweight DBNet detection result is as follows: 


Figure 11: the ultra-lightweight DBNet detection result 


The following shows the prediction result of quickly using the text detection model. The detailed prediction and inference 
code will be explained in Chapter 5. 


!mkdir inference 

'cd inference && wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_ 
odet_infer.tar -O ch_PP-OCRv2_det_infer.tar && tar -xf ch_PP-OCRv2_det_infer.tar 
!python tools/infer/predict_det.py image_dir="./doc/imgs/00111002.jpg" --det_model_ 
odir="./inference/ch_PP-OCRv2_det_infer" use_gpu=Fals 

from PIL import Image 

img_det = Image.open('./inference_results/det_res_00111002.jpg') 


plt.figure(figsize=(14, 10)) # window size 
plt.imshow (img_det) 

qolateerccets (Gots) 

plt.title('Detection') 


plt.show() 
mkdir: cannot create directory ‘inference’: Fil musts 
—-2021-12-24 21:07:17-- https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP- 
sOCRv2_det_infer.tar 
Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 182.61.200.229, 182. 


461;,200.195, 2409 8c04: 1001100270: ££:b001 73684 
Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |182.61.200. 
6229|[:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 3190272 (3.0M) [application/x-tar] 
Saving to: ‘ch_PP-OCRv2_det_infer.tar’ 


ch_PP-OCRv2_det_inf 100%[ >] 3.04M 4.13MB/s ain Ons 
2021-12-24 21:07:18 (4.13 MB/s) - ‘ch_PP-OCRv2_det_infer.tar’ saved [3190272/ 
~3190272] 


(continues on next page) 
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[2021/12/24 21:07222] root INFO: The predict time of ./doc/imgs/00111002.jpq: 1. 
ep GUS 28190757202 1 
[2021/12/24 21:07:22] root INFO: The visualized image saved in ./inference_results/ 
sdet_res_00111002.jpg 
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6.2.2 Direction Classifier 


The aim of the direction classifier is to classify the direction of the text instance detected by the text, rotate the text to 0 
degrees, and then send it to the text recognizer. In PP-OCR, we choose two directions: 0 degree and 180 degree. The 
following is a detailed introduction to the speed and accuracy optimization strategy for the direction classifier. 


Figure 12: Directional classifier ablation experiment 


Lightweight Backbone Network 


Same as the text detector, we still use MobileNetV3 as the backbone network of the direction classifier. Because the 
task of direction classification is simpler, we use MobileNetV3 small 0.35x to balance model accuracy and prediction 
efficiency. Experiments show that using a larger backbone will not be further improved the accuracy. 


Performance of the different backbones for direction 
classification 


MobileNetV3_small_x0.5 MobileNetV3_small_x0.35 ShuffleNetV2_x0.5 


ME Model Size(M) —@=— Accuracy =@= Inference Time(CPU, ms) 


Figure 13: Comparison of the accuracy of direction classifiers with different backbone networks 


Data Agumentation 


Data augmentation refers to transforming the image and sending it to the network for training, which can improve the 
generalization performance of the network. Commonly used data augmentation methods includes rotation, perspective 
distortion and transformation, motion blur and Gaussian noise transformation. In PP-OCR, these data augmentation 
methods are categorized into BDA (Base Data Augmentation). It is verified that BDA can significantly improve the 
accuracy of the direction classifier. 


The following shows the effect of some BDA data augmentation methods. 


Original image® Transparency &) 


Rotation®) 


Figure 14: BDA data augmentation effect 
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Besides BDA, we also introduce some higher-level data augmentation methods to improve classification, such as Au- 
toAugment (Cubuk et al. 2019), RandAugment (Cubuk et al. 2020), CutOut (DeVries and Taylor 2017), RandErasing 
(Zhong et al. 2020), HideAndSeek (Singh and Lee 2017), GridMask (Chen 2020), Mixup (Zhang et al. 2017) and 
Cutmix (Yun et al. 2019). 


These data augmentations are divided into 3 categories: 

(1) Image transformation: AutoAugment, RandAugment 

(2) Image tailoring: CutOut, RandErasing, HideAndSeek, GridMask 
(3) Image mixing: Mixup, Cutmix 


The visual comparison results of different high-level data augmentation are given: 


Original image Image Transformation Image Tailoring Image 
(0) Standard Transformation (3) CurOut (7) Mixup 
(1) AutoAugment (4) RandErasing (8) Cutmix 
(2) RandAugment (5) HideAndSeek 
(6) GridMask 


Figure 15: High-level data augmentation visualization effect 
But experiments show that, except for RandAugment and RandErasing, most methods are not suitable for the direction 


classifier. The figure below also explains the changes in model accuracy with different data augmentation strategies. 


Ablation study of data augmentation for direction 
classification 


BDA+RandAugment i (92) 2 
BDA+AutoAugment 0.9133 
BDA+RandomErasing i © .9193 
BDA+GridMask a 0.914 
BDA+HideAndSeck Es 0.8598 
BDA+Cutout i 0.9081 
BDA+Mixup a 80.9104 
BDA+CutM x ee 0.9083 
CA ns ©. 9134 
FC nn 80.8879 


0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 


Figure 16: Ablation study of data augmentation for direction classication 


Finally, we combine BDA and RandAugment as a data enhancement strategy for the direction classifier in the training. 
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¢ RandAugment code demo 


# Reference Code: 


# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/data/imaug/__init__ 


o.py 

import random 

from PIL import Image 

from ppocr.data.imaug import DecodeImage, RandAugment, transform 


np.random.seed (1) 
random. seed (1) 


img = Image.open('./doc/imgs_words/ch/word_4.jpg') 


# Draw the original image 


plt.figure("Image1") # Image window name 
plt.imshow (img) 
lbs sera (Vom) o, webbeia (ome (elle: eaksy Shc) Oicse 
plt.title('Before RandAugment') # image title 
plt.show() 
data = {'image':None} 
with open('./doc/imgs_words/ch/word_4.jpg','rb') as f: 
img = f.read() 
data['image'] = img 


# Define transformation operator 
ops_list = [DecodeImage(), RandAugment () ] 


# Data transformation 
data = transform(data,ops_list) 


img_auged = data['image'] 


# show 
img_auged = Image.fromarray (img_auged, 'RGB') 
plt.figure("Image") # Image window name 


plt.imshow (img_auged) 
jolie geeks (ei) )) 4 ebueial (oie (elle) weeks) ThSe Nese 


plt.title('After RandAugment') # image titlepilt.title('After RandAugment') # Image. 


otitle 
plt.show() 


Before RandAugment 
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After RandAugment 
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The following shows the prediction effect of quickly using the direction classifier model. The specific predictive reasoning 
code will be explained in detail in Chapter 5. 


# Reference Code: 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/tools/infer/predict_cls. 
SPY. 

'cd inference && wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile 
ov2.0_cls_infer.tar -O ch_ppocr_mobile_v2.0_cls_infer.tar && tar -xf ch_ppocr_mobile 
ov2.0.tars_infer 


# Direction classifier makes classification 
'python tools/infer/predict_cls.py --image_dir="./doc/imgs_words/ch/word_1.jpg" --cls_ 
model_dir="./inference/ch_ppocr_mobile_v2.0_cls_infer" use_gpu=False 


# Import image 
import cv2 
img = cv2.imread("./doc/imgs_words/ch/word_1.jpg") 


Ou ,aumsiavony (shane ll B58 RBI) 
plt.show() 


# Rotate 180 degrees 

img = cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE) 
img = cv2.rotate(img, cv2.ROTATE_90_CLOCKWIS 
cv2.imwrite("./test.png", img) 


ial 


# Use the direction classifier to classify the rotated imag 
'python tools/infer/predict_cls.py image_dir="./test.png" --cls_model_dir="./ 
oinference/ch_ppocr_mobile_v2.0_cls_infer" use_gpu=Fals 


fous amino (Shane 5 48S I) 


plt.show() 
--2021-12-24 21:19:04-- https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_ 
smobile_v2.0_cls_infer.tar 
Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 182.61.200.195, 182. 


e611 :2002229, 2409 8c04 1001 s1002: 0; ffs bOOT:368a 
Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |182.61.200. 


6195|[:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 1454080 (1.4M) [application/x-tar] 
Saving to: ‘ch_ppocr_mobile_v2.0_cls_infer.tar’ 
ch_ppocr_mobile_v2. 100%[ >] 1.39M --.-KB/s alighted bts! 
2021-12-24 24519704 (14.3 MB/s) — “ch ppocr_ mobile v2.0 cls infer star’ saved. 
+ [1454080/1454080]j (continues on next page) 


136 Chapter 6. PP-OCR System and Strategy 


Dive into OCR 
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[2021/12/24 21:19:06] root INFO: Predicts of ./doc/imgs_words/ch/word_1.jpg:['0',u 
30.9998784] 


0 


50) 100 150 200 250 300 350 


i) 


[2021/12/24 2121909] root. INFO: Predicts of ./test.png< "180", 0.9999759' 


0 


Input Resolution Optimization 


Generally speaking, when the input resolution of the image increases, the accuracy will also increase. Since the backbone 
network parameters of the direction classifier are very small, even if the resolution increases, the inference time will not 
increase significantly. The input image scale of the direction classifier is enlarged from 3x32x100 to 3x48x192, and 
then the accuracy of the direction classifier grows from 92.1% to 94.0%, but the prediction time only increases from 
3.19msto3.21 ms. 


Here is a comparison of the image sizes at the two scales. 


100 
192 


Figure 17: Image size comparison between 32x100 and 48x192 
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Model Quantization Strategy-PACT 


Model quantization is to convert floating point calculations into low-rate specific point calculations, which can make neural 
network models lower in latency, smaller in size, and less in computational power. 


Model quantization is mainly divided into offline quantization and online quantization. Offline quantization refers to 
a fixed-point quantization method that uses techniques such as KL divergence to determine quantization parameters, 
and does not require retraining after quantization; online quantization refers to the method that determines quantization 
parameters during the training process. Compared with offline quantization, its accuracy loss is smaller. 


PACT (PArameterized Clipping acTivation) is a new online quantization method that can remove some extreme values 
from the activation layer in advance. After removal, the model can learn more appropriate quantization parameters. 
The preprocessing of the activation value of the ordinary PACT methods is based on the RELU function, and the formula 
is: 


0 « € (—o,0) 
y = PACT (x) =0.5(\2|—|a-al+a)=4 « «x €(0,a) 
a 2« €[a,+oo) 


All activation values greater than a certain threshold will be reset to a constant. However, the activation function in 
MobileNetV3 is not only ReLU, but also hardswish. Therefore, ordinary PACT quantization will lead to higher accuracy 
loss. To reduce the quantization loss, the formula of the activation function is modified to: 


a «x € (—oo,—a) 
y = PACT(«) = x «x € [—a,q) 
a «£ €[a,+oo) 


PaddleOCR provides quantization scripts for PP-OCR kits. For specific links, please refer to PaddleOCR Model Quan- 


tization Tutorial. 


Direction Classifier Configuration Instruction 


When training the direction classifier, some key fields and descriptions in the configuration file are explained in the 
following. For the complete configuration file, refer to cls_mv3.yml. 


Architecture: 
model_type: cls 
algorithm: CLS 
Transform: 
Backbone: 

name: MobileNetV3 # Configure the classification model as MobileNetV3 
scale: 0.35 
model_name: small 
Neck: 
Head: 
name: ClsHead 
class_dim: 2 


Train: 
dataset: 

name: SimpleDataSet 

data_dir: ./train_data/cls 

label_file_ list: 
—./train_data/cls/train.txt 

transforms: 
-DecodeImage: # load image 


(continues on next page) 
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(continued from previous page) 


img_mode: BGR 
channel_first: False 
-ClsLabelEncode: # Class handling label 
—-RecAug: 
use_tia: False # Configure BDA without using TIA data augmentation 
-RandAugment: # Configure random augmentation 
-ClsResizeImg: 
image_shape: [3, 48, 192] # Here, [3, 32, 100] is changed to [3, 48, 192]u 
oto optimize the input resolution 
—-KeepKeys: 
keep_keys: ['image','label'] # dataloader will return list in this order 
loader: 
shuffle: True 
batch_size_per_card: 512 
drop_last: True 
num_workers: 8 


Summary of the Direction Classifier Experiment 


When improving the direction classifier model, we use a lightweight backbone network and model quantization, and 
reduce the model size from 0.85M** to 0.46M**. Also, the combination of data augmentation, high resolution and other 
features improves the model accuracy by more than 2%. The comparison in the ablation experiment is shown below. 


Input Resolution | PACT Quantization | Accuracy | Model Size (M) | Inference Time (SD 855, ms) 
3 x 32 « 100 3.19 
3 x 48 x 192 


3 x 48 x 192 


Figure 18: Direction classifier ablation experiment 


6.2.3 Text recognition 


The text recognizer of PP-OCR is the CRNN model which uses CTC loss to solve the prediction of texts in various 
lengths. 


The structure of the CRNN model is shown below. 
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Figure 19: CRNN structure diagram 


For the text recognizer, PP-OCR aims to optimize the model from the backbone network, head structure optimization, 
data enhancement, regularization, feature map downsampling strategy, and quantization. The ablation experiment is 
shown below. 
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Figure 20: Ablation experiment of the CRNN recognition model 


The optimization strategies of the text recognition model are described in detail below. 


Lightweight Backbone Network and Head Structure 


¢ Lightweight backbone network 


In text recognition, MobileNetV3 is still used as the backbone, the same as text detection. MobileNetV3_small_x0.5 
is selected to further balance accuracy and efficiency. If there is no requirement for the model size, you can choose 
MobileNetV3_small_x1 which can significantly improve the accuracy with the model size only increased by 5M. 


Model Inference 
Backbone Accuracy Size (M) Time 
(CPU, ms) 
MobileNetV3_ 
small_x0.35 mene o - 
eee 0.6556 23 17.27 
small_x0.5 
eene 0.6933 28 19.15 
small_x1 


Figure 21: Comparison of recognition model accuracy under different backbone networks 


¢ Lightweight head structure 


In CRNN, the lightweight head used for decoding is a fully connected layer used to decode sequence features into ordinary 
predicted characters. The dimension of the sequence feature has a great influence on the model size of the text recognizer, 
especially on scence of the recognition of over 6,000 Chinese characters (if the sequence feature dimension is set to 
256, the model size of the head is **6.7M **). In PP-OCR, we have conducted experiments on the dimension of 
sequence features, and determine the dimension as 48, balancing accuracy and efficiency. The conclusions of some 
ablation experiments are as follows. 
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the number Model Inference 
of channel Accuracy Size (M) Time 
(CPU, ms) 


256 0.6556 17.27 


96 0.6673 | 8 | 13.36 


64 0.6642 12.64 
48 0.6581 12.26 


Figure 22: Accuracy comparison of feature dimensions of different sequences 


Data Augmentation 


Besides the aforementioned BDA (Basic Data Enhancement), which is often used in text recognition, TIA (Luo et al., 
2020) is also another effective data augmentation method for text recognition. TIA is a data augmentation method for 
scene texts. It sets multiple reference points in the image, and then randomly moves the points to generate new images 
through geometric transformation, which can greatly improve data diversity and the generalization ability of the model. 
The basic flow chart of TIA is shown in the figure: 


Input B 


Augmented Image 


Move control 
points 
following 
a certain 


‘ 
‘ distribution. | 


Figure 23: The basic flow chart of TIA 


Experiments show that the use of TIA data augmentation can improve the accuracy of the text recognition model by 0.9% 
on a very high baseline. 


Here is the visualization of the three types of data augmentation involved in TIA. 


# Reference Code: 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/data/imaug/text_ 
oimage_aug/augment.py 

import cv2 

from ppocr.data.imaug.rec_img_aug import tia_distort, tia_stretch, tia_perspective 
img = cv2.imread("./doc/imgs_words/ch/word_1.jpg") 


Tmoqeoutclh — itvas dalstzome (imag, | 2y5)) 


(continues on next page) 
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img_out2 = tia_stretch(img, 3) 
img_out3 = tia_perspective (img) 
lt.figure(figsize=(20, 8)) 
.subplot (1,4,1) 
JARMAN AB IMNG|| Sip 8 pk Sil ||} 
ASuUbplOE (17 4, 2:) 
ANNO (abe, couhedl [87 s4-83— 1 I) 
7SuUbplot(l, 4,3) 

HEmMsShow (amowoueZ jcc) 
.-subplot (1,4,4) 

sAuMevon (shmop, oie s} ([.Si5 45 RBA] 5} 
. show () 
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Gr dar Yar (ar iee iar Gr igr (op 
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Learning Rate Strategy and Regularization 


In the training of the recognition model, the learning rate reduction strategy is the same as the text detection, and the 
learning rate strategy of Cosine+Warmup is also used. 


Regularization is widely used to avoid overfitting, and it includes L1 regularization and L2 regularization. In most usage 
scenarios, we use L2 regularization, which calculates the L2 norm of the weights in the network and add it to the loss 
function. With the help of L2 regularization, the weight of the network tends to be smaller, and finally the parameters in 
the entire network will be close to 0, thereby alleviating the over-fitting of the model and improving its generalization. 


Our experiments have found that L2 regularization has a great influence on the recognition accuracy in text recognition. 


Feature Map Downsampling Strategy 


In downstream vision tasks such as detection, segmentation, and OCR, the backbone network is that used in the image 
classification, with the input resolution of 224x224. The width and height will be down-sampled at the same time. 


However, in text recognition, since the input image is generally 32x100, with an extreme aspect ratio. In this case, 
downsampling the width and the height simultaneously will cause severe feature loss. Therefore, it is required to adapt 
the feature map downsampling in applying the backbone network in the image classification to text recognition (If you 
change the backbone network by yourself, you also need to pay attention here). 


In PaddleOCR, the height and width of the input image set by the CRNN Chinese text recognition model are set to 32 and 
320. The original MobileNetV3 comes from the classification model. As mentioned above, the step size of downsampling 
needs to be adapted for the input resolution of the text image. Specifically, in order to retain more level information, the 
step size of the down-sampling feature map is modified from (2,2) to (2,1), except for the first down-sampling. The final 
result is shown in the figure below. 
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Stride Feature Map 
Resolution 
Input Operator exp size | #out | SE | NL | s 
2242 x 3 conv2d, 3x3 - | 16 - | HS | 2 
112? x 16 bneck, 3x3 16 | 16 ¥ | RE | 2 == (2,1)(8*160) — (1,1)(16*160) 
562 x 16 bneck, 3x3 72 =| #24 - | RE | 2-=> (2,1)(4*160) —» (2,1)(8*160) 
28? x 24 bneck, 3x3 88 24 - RE | 1 
28? x 24 bneck, 5x5 96 40 ¥ | HS | 2 > (2,1)(2*160) —» (2,1)(4*160) 
14 x 40 bneck, 5x5 240 40 ¥ | HS} 1 
142 x 40 bneck, 5x5 240 40 ¥ | HS} 1 
14? x 40 bneck, 5x5 120 48 ¥ | HS] 1 
14? x 48 bneck, 5x5 144 48 ¥ | HS} 1 
14? x 48 bneck, 5x5 288 96 ¥ | HS | 2 > (2,1)(1*160) > (2,1)(2*160) 
7? x 96 bneck, 5x5 576 96 ¥ | HS} 1 
7? x 96 bneck, 5x5 576 96 ¥ | HS} 1 
7? x 96 conv2d, Ix] - 576 ¥ | HS} 1 
7? x 576 pool, 7x7 - zz - - | Le (2,2)(1*80) —® (2,2)(1*80) 
By 17576 | conv2d Ixl, NBN | ~- ~ | 1280 | - | AS 1) 
__-22 x1280 | conv2d 1x1, NBN | _- | eds La) abandoned 


Figure 25: visualization of downsampling step strategy optimization 


In order to retain more vertical information,the step size of the second downsampling feature map are modified from (2,1) 
to (1,1). Therefore, the step size s2 of the second down-sampling feature map will significantly affect the resolution of 
the entire feature map and the accuracy of the text recognizer. In PP-OCR, s2 is set to (1,1) to gain better performance. 
At the same time, due to the increase of horizontal resolution, the inference time of the CPU increases from 11.84ms 


to12.96ms. 


The following shows the comparison of the feature map scales before and after stride optimization. Although the output 
feature maps have the same scale, after stride is modified from (2,1) to (1,1), the feature information is preserved better 


in encoding. 


# Reference Code: 


# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/modeling/backbones/ 


orec_mobilenet_v3.py 


from ppocr.modeling.backbones.rec_mobilenet_v3 import MobileNetv3 


mv3_ori = MobileNetV3 (model_name="small", 
mv3_new = MobileNetV3 (model_name="small", 
x = paddlerrand (1, 3) 327 320) 


janis) Coueal, (5) 
mv3_new (x) 


y_ori = 
y_new = 


print (y_ori.shape) 
print (y_new. shape) 


[1, 
[1, 


288, 
288, 


1, 
1, 


80] 
80] 


scale=0.5, 
scale=0.5, 
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PACT Online Quantization Strategy 


We use a scheme similar to the quantization of the direction classifier to reduce the model size of the text recognizer. 
Due to the complexity of LSTM quantization, LSTM is not quantized in PP-OCR. Through the quantization strategy, 
the model size is reduced by 67.4%, the prediction speed increases by 8%, and the accuracy by 1.6%. Quantization can 
reduce model redundancy and enhance the expression ability of the model. 


PACT Model | ‘nference 
Quantization Accuracy Size (M) Time 
(SD 855, ms) 


0.6581 2 


r 


Figure 26: Ablation experiment of model quantitation 


Text Recognition Pre-training Model 
A suitable pre-trained model can speed up the convergence of the model. In real scenarios, the data used for text recog- 


nition is usually limited. In PP-OCR, tens of millions of levels of data are synthesized to train the model, and real data 
will be refined based on the model. Finally, recognition accuracy has increased from 65.81% to 69%. 


Description of Text Recognition Configuration 


A brief description of the training configuration of CRNN is given below. For the complete configuration file, refer to: 
rec_chinese_lite_train_v2.0.yml. 


Optimizer: 
name: Adam 
betal: 0.9 
beta2: 0.999 
lr: 
name: Cosine # Configure Cosine learning rate reduction strategy 
learning_rate: 0.001 
warmup_epoch: 5 # Configure warmup learning rate 
regularizer: 
name:'L2' # Configure L2 regular 
factor: 0.00001 


Architecture: 
model_type: rec 
algorithm: CRNN 
Transform: 
Backbone: 
name: MobileNetV3 # Configure Backbone 
scale: 0.5 
model_name: small 
small_stride: [1, 2, 2, 2] # Configure the stride for downsampling 
Neck: 
name: SequenceEncoder 
encoder_type: rnn 
hidden_size: 48 # Configure the dimension of the last fully connected layer 
Head: 


(continues on next page) 
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name: CTCHead 
fc_decay: 0.00001 


Train: 
dataset: 
name: SimpleDataSet 
data_dir: ./train_data/ 
label_file_list: ["./train_data/train_list.txt"] 
transforms: 
-DecodeImage: # load image 
img_mode: BGR 
channel _ first: False 
-RecAug: # Configuration data enhancement BDA and TIA, TIA is used by default 
-CTCLabelEncode: # Class handling label 


-RecResizeiImg: 
image_shape: [3, 32, 320] 
—-KeepKeys: 
keep_keys: ['image','label','length'] # dataloader will return list in thiswu 
sorder 
loader: 


shuffle: True 
batch_size_per_card: 256 
drop_last: True 
num_workers: 8 


Summary of Recognition Optimization 


In terms of model size, PP-OCR reduces the model size from 4.5M to 1.6M through a lightweight backbone network, 
sequence dimension tailoring, and model quantization strategies. In terms of the accuracy, it improves the accuracy by 
15.4% on the validation set by using optimization strategies such as TIA data augmentation, cosine-warmup learning rate 
strategy, L2 regularization, feature map resolution improvement, and pre-training model. 


Some of the recognition results in PP-OCR are shown below. 
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Figure 27: The recognition results in PP-OCR 
The code demonstration of the text recognition model is as follows. 


# Visualize the original image 


img = cv2.imread("./doc/imgs_words/ch/word_4.jpg") 
folag 4 auslavone(SUMer || ooo 7 Fel) 
plt.show() 


!'cd inference && wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_ 
rec_infer.tar -O ch_PP-OCRv2_rec_infer.tar && tar -xf ch_PP-OCRv2_rec_infer.tar 

!'python tools/infer/predict_rec.py image_dir="./doc/imgs_words/ch/word_4.jpg" --rec_ 
model_dir="./inference/ch_PP-OCRv2_rec_infer" use_gpu=Fals 


—-2021-12-24 21:50:31-- https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP- 
sOCRv2_rec_infer.tar 
Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 182.61.200.195, 182. 


#61 .200.229, 2409: 8c042100121 00270: fF: b001: 3 68a 

Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |182.61.200. 
4195|:443... connected. 

HTTP request sent, awaiting response... 200 OK 

Length: 8875520 (8.5M) [application/x-tar] 


(continues on next page) 
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Saving to: ‘ch_PP-OCRv2_rec_infer.tar’ 


ch_PP-OCRv2_rec_inf 100%[ >] 8.46M 24.2MB/s in 0.4s 
2021-12-24 21:50:31 (24.2 MB/s) - ‘ch_PP-OCRv2_rec_infer.tar’ saved [8875520/ 
38875520] 


[2021/12/24 21:50:33] root INFO: Predicts of ./doc/imgs_words/ch/word_4.jpg: (‘BARRE 
ey). O94 09 5:0'5:) 


import cv2 
# Rotate ./doc/imgs_words/ch/word_1.jpg 180 degrees to get 
img = cv2.imread("./doc/imgs_words/ch/word_1.jpg") 


jonkie 5 amnSlovonn(sumerll S78 5-8:8 il I) 
plt.show() 


!'python tools/infer/predict_rec.py image_dir="././doc/imgs_words/ch/word_1.jpg" -—- 
orec_model_dir="./inference/ch_PP-OCRv2_rec_infer" use_gpu=Fals 


50) 100 150 200 250 300 350 


[2021/12/24 21:52:00] root INFO: Predicts of ././doc/imgs_words/ch/word_1.jpg: ( 
o'PRBR', 0.9967349) 


6.3 Interpretation of PP-OCRv2 Optimization Strategies 


The Section 2 mainly introduces PP-OCR and its 19 optimization strategies in detail. 


Compared with PP-OCR, PP-OCRv?2 further optimizes the three aspects of the backbone network, including data aug- 
mentation, and the loss function to solve the poor end-to-side prediction efficiency, complex background, and misrecog- 
nition of similar characters. It also introduces The knowledge distillation training strategy further improves the accuracy 
of the model. That is: 


¢ Detection model optimization: (1) CML knowledge distillation strategy; (2) CopyPaste data augmentation strategy; 


¢ Recognition model optimization: (1) PP-LCNet; (2) U-DML knowledge distillation strategy; (3) Enhanced CTC 
loss. 


This section mainly interprets the optimization strategies of PP-OCRv2 based on the optimization process of text detection 
and recognition models. 
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6.3.1 Detailed Explanation of Text Detection Model Optimization 


In the process of optimizing the text detection model, the CML knowledge distillation and CopyPaste data augmentation 
strategy are adopted. And the text detection model is increased from 0.759 to 0.795 with the size of the text detection 
model remaining unchanged. The ablation experiment is shown below. 


Strategy Precision | Recall | Hmean wees aa er 
PP-OCR det 0.718 0.805 | 0.759 3.0 129 
PP-OCR det + DML 0.743 0.815 | 0.777 3.0 129 
PP-OCR det + CML 0.746 0.835 | 0.789 3.0 129 
PP-OCR det + CML + CopyPaste 0.754 0.840 | 0.795 3.0 129 


Table 2: Ablation study of CML and CopyPaste for text detection. 


Figure 28: Ablation experiment of the PP-OCRv2 detection model 


CML Knowledge Distillation Strategy 


The knowledge distillation method is commonly used in deployment. By using a large model to guide the learning of 
a small model, the accuracy of the small model can usually be further improved while the prediction time is the same, 
thereby further improving the actual deployment experience. 


The standard distillation method is to use a large model as the teacher model to guide the student model to improve the 
effect. Later on, the DML mutual learning distillation method has been developed, which is to learn from each other 
through two models with the same structure. Compared with the former, DML does not depend on the large teacher 
model, making the distillation training process simpler, and improving the model output efficiency. 


The PP-OCRvz2 text detection model uses the CML (Collaborative Mutual Learning) distillation method between the 
three models, which involves mutual learning between two student models of the same structure, and introduces a larger 
model structure. Teacher model. The comparison between CML and other distillation algorithms is shown below. 


— 


CML: involving mutual learning 
between two student models 
a larger model structure 


DML: learning from each other through 


srandan) Disaieven two models with the same structure 


Figure 29: Comparison between CML and other knowledge distillation algorithms 
In the text detection task, the structure diagram of CML is as follows. The response maps refer to the probability map 
output of the last layer of DBNet. In the whole training process, three loss functions are included. 
¢ GT loss 
¢ DML loss 


¢ Distill loss 
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The backbone network of the teacher model here is ResNet18_vd, and that of the two Student models is MobileNetV3. 


response maps 
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Distill Loss 
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Figure 30: CML structure diagram 


GT Label 


¢ GT loss 


Most of the parameters in the two student models are initialized from scratch, so they need to be supervised by groundtruth 
(GT) information during the training. The pipeline of the DBNet training task is shown below. The output mainly contains 
three kinds of feature maps. 


probability map 


approximate 
binary map 


threshold map 


Figure 31: DBNet head structure 


Use different loss functions to supervise these feature maps, as shown in the following table. 


Feature map Loss function weight 
Probability map | Binary cross-entropy loss | 1.0 
Binary map Dice loss a 
Threshold map | LI loss B 
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The final GT loss can be expressed as below. 
Lossot(Touts gt) = L, Sonuis gt) a aly (Souts gt) TF Bl, (Scrat gt) 


¢ DML loss 


For two identical student models, they should have the same output for the same input. The final output of DBNet is 
probability maps (response maps). Based on the KL divergence, calculate the DML loss of two student models as follows. 


ionsk) + KL(S$2, out [Sl cue) 
2 


Among them, KL (- | - ) is the calculation formula of KL divergence, and the calculated DML loss is symmetrical. 


KL(S1,6uel|S2 


LOSS qn = 


¢ Distill loss 


In CML, the teacher model is introduced to supervise two Student models at the same time. In PP-OCRv2, only the 
feature Probability map is supervised by distillation. Specifically, for one of the student models, the calculation 
method is as follows. Ip(-) and lb(-) represent binary cross-entropy loss and dice loss respectively. The loss calculation 
process of the other student model is exactly the same. 


LOS8 ai stilt _ Vb Sues Feito Tout)) + Ly (Souts Faita(Tout)) 


Finally, by adding the above three losses, we can get the loss function used for CML training. 


The detection configuration file is ch_PP-OCRv2_det_cml.yml, distillation structure part The configuration and part of 
the explanation are as follows. 


Architecture: 
name: DistillationModel # Model name, this is the general distillation modelu 
orepresentation. 
algorithm: Distillation # Algorithm name, 
Models: # Models, including the configuration information of the subnet 
Teacher: # Teacher sub-network, including ‘pretrained* and °freeze_params 
einformation and other parameters used to construct the sub-network 
freeze_params: true # Whether to fix the parameters of the Teacher network 
pretrained: ./pretrain_models/ch_ppocr_server_v2.0_det_train/best_accuracy #. 


pretrained model 
return_all_feats: false # Whether to return all features, when it is True, the. 
soutput of backbone, neck, head and other modules will be returned 
model_type: det # Model category 
algorithm: DB # Teacher network algorithm name 
Transform: 
Backbone: 
name: ResNet 
layers: 18 
Neck: 
name: DBFPN 
out_channels: 256 
Head: 
name: DBHead 
k: 50 
Student: # Student subnet 
freeze_params: false 
pretrained: ./pretrain_models/MobileNetV3_large_x0_5_pretrained 
return_all_feats: false 
model_type: det 
algorithm: DB 
Backbone: 
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name: MobileNetVv3 

scale: 0.5 

model_name: large 

disable_se: True 
Neck: 

name: DBFPN 

out_channels: 96 
Head: 

name: DBHead 

k: 50 

Student2: # Student2 subnet 

freeze_params: false 
pretrained: ./pretrain_models/MobileNetV3_large_x0_5_pretrained 
return_all_feats: false 
model_type: det 
algorithm: DB 
Transform: 
Backbone: 

name: MobileNetv3 

scale: 0.5 

model_name: large 

disable_se: True 
Neck: 

name: DBFPN 

out_channels: 96 
Head: 

name: DBHead 

k: 50 


The implementation of the Dist illationModel1 is in the distillation_model.py file, and the implementation and part 
of the explanation of Dist illationMode1 are as follows. 


class DistillationModel (nn.Layer) : 
def __ init__(self, config): 
more 
the module for OCR distillation. 
args: 
config (dict): the super parameters for module. 
more 
super ().__init__() 
self.model_list = [] 
self.model_name_list = [] 
# According to each field in Models, extract the name of the subnet and the. 
«corresponding configuration 
for key in config["Models"]: 
model_config = config["Models"] [key] 
freeze_params = False 
pretrained = None 
if "freeze_params" in model_config: 


freeze_params = model_config.pop("freeze_params") 
if "pretrained" in model_config: 
pretrained = model_config.pop("pretrained") 
# According to the configuration of each sub-network, generate sub- 
onetworks based on BaseModel 
model = BaseModel (model_config) 
# Determine whether to load the pre-trained model 
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if pretrained is not None: 

load_pretrained_params (model, pretrained) 
# Determine whether it is necessary to fix the model parameters of the. 

osub-network 

if freeze_params: 

for param in model.parameters(): 

param.trainable = False 

self.model_list.append(self.add_sublayer (key, model) ) 
self.model_name_list.append (key) 


def forward(self, x): 
result_dict = dict () 
for idx, model_name in enumerate (self.model_name_list): 
result_dict [model_name] = self.model_list [idx] (x) 
return result_dict 


Use the following command to quickly the distillation model. 


# Reference Code 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/modeling/ 
earchitectures/__init__.py 

from tools.program import load_config 

from ppocr.modeling.architectures import build_model 

config_path = "./configs/det/ch_PP-OCRv2/ch_PP-OCRv2_det_cml.yml1" 

config = load_config(config_path) 

model = build_model (config['Architecture']) 

print (model) 


You can quickly experience the training process of CML distillation in the following way. 


# Reference Code 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/tools/train.py 

os.chdir ("/home/aistudio/PaddleOCR/") 

lmecliara. tersactneclateal 

'wget https://paddleocr.bj.bcebos.com/dataset/det_data_lesson_demo.tar -O det_data_ 
olesson_demo.tar && tar -xf det_data_lesson_demo.tar && rm det_data_lesson_demo.tar 
‘mkdir pretrain_models && wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ 
eppocr_server_v2.0_det_train.tar && tar -xf ch_ppocr_server_v2.0_det_train.tar 

‘mv ch_ppocr_server_v2.0_det_train pretrain_models/ && rm ch_ppocr_server_v2.0_det 
otrain.tar 

# Training script 


# Note: Only one epoch is trained here, which is only used for quick demonstration, 
sand the indicators will be very poor 
'python tools/train.py -c configs/det/ch_PP-OCRv2/ch_PP-OCRv2_det_cml.yml \ 


-o Global.pretrained_model="" \ 
Train.dataset.data_dir="./det_data_lesson_demo/" \ 
Train.dataset.label_file_list=["./det_data_lesson_demo/train.txt"] \ 


Train.loader.num_workers=0 \ 
Eval.dataset.data_dir="./det_data_lesson_demo/" \ 
Eval.dataset.label_file_list=["./det_data_lesson_demo/eval.txt"] \ 
Eval.loader.num_workers=0 \ 

Optimizer.1lr.learning_rate=0.00025 \ 

Global.epoch_num=1 
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Data Augmentation 


Data augmentation is one of the important means to improve the generalization ability of the model. CopyPaste is a novel 
data augmentation technique that has been validated in target detection and segmentation. With CopyPaste, you can 
synthesize text instances to balance the ratio of positive to negative samples in the training image. In contrast, traditional 
image rotation, random flip and random cropping cannot achieve this. 


The steps of CopyPaste include: 

1. Randomly selecting two training images; 
Random scale jittering; 
. flipping horizontally ; 


. Randomly selecting a target subset in an image; 


. Pasting it in a random location in another image. 


In this way, the samples are more diverse, and the model is more robust. As shown in the figure below, the text cropped 
from the image in the lower left corner is randomly rotated and zoomed and then pasted into the upper left image. In this 
way, the background of the text is more diverse. 


Original 
Image 


Cropped 
Image 


Figure 32: A “CopyPaste* result 


If you want to use CopyPaste in model training, add CopyPaste in the Train.transforms configuration field. 


Train: 
dataset: 
name: SimpleDataSet 
data_dir: ./train_data/icdar2015/text_localization/ 
label_file_list: 
- ./train_data/icdar2015/text_localization/train_icdar2015_label.txt 
ratio_list: [1.0] 
transforms: 
-— DecodeImage: # load image 
img_mode: BGR 
channel_first: False 
- DetLabelEncode: # Class handling label 
- CopyPaste: # Add CopyPaste 
— IaaAugment: 
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augmenter_args: 
-— { “type’: Pillapis, ‘args’: 4 "p'? 0.5. + } 


- { "type": Affine, “args': { ‘rotate’: [-10, 10] } } 
- { 'type': Resize, ‘args': { 'size': [0.5, 3] } } 
— EastRandomCropData: 


size: [960, 960] 
max_tries: 50 
keep_ratio: true 
— MakeBorderMap: 
shrink_ratio: 
thresh_min: 0.3 
thresh_max: 0.7 
— MakeShrinkMap: 
shrink_ratio: 0.4 
min_text_size: 8 
— NormalizeImage: 
scale: 1./255. 
mean: [0.485, 0.456, 0.406] 
std: [0.229, 0.224, 0.225] 
order: 'hwc' 


lo) 
ws 


— ToCHWImage: 
— KeepKeys: 
keep_keys: ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 
o'shrink_mask'] # the order of the dataloader list 
loader: 


shuffle: True 
drop_last: False 
batch_size_per_card: 8 
num_workers: 4 


For example of CopyPaste, refer to copy_paste.py. 


In the following, CopyPaste will be demonstrated based on the icdar2015 detection dataset. 


import os 
import sys 


os.chdir ("/home/aistudio/PaddleOCR/") 
‘unzip -oq /home/aistudio/data/data46088/icdar2015.zip 


# Reference Code: 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/data/simple_ 
odataset.py 

import logging 

import random 

import numpy as np 


from ppocr.data.imaug import create_operators, transform 
logger = logging.basicConfig() 


# CopyPaste example 
class CopyPasteDemo (object): 
Clip atid (eieulbe, 9) 8 
Sellivdatardin = "5/acdar20d5/textelocallazvataon/ 
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self.label_file_list = "./icdar2015/text_localization/train_icdar2015_label. 
Screw 


self.data_lines = self.get_image_info_list (self.label_file_list) 
self.data_idx_order_list = list (range(len(self.data_lines) ) ) 
transforms = [ 
{"DecodeImage": {"img_mode": "BGR", "channel_first": False}}, 
{"DetLabelEncode": {}}, 
{WiGopybastewe 1 VObwWicciempastcmicaiadolss ail. Ont 
] 


self.ops = create_operators (transforms) 
# Select an image and copy the content to the current image 
def get_ext_data(self, idx): 

ext_data_num = 1 

ext_data = [] 

load_data_ops = self.ops[:2] 


next_idx = idx 


while len(ext_data) < ext_data_num: 


next_idx = (next_idx + 1) % len(self) 
file _ idx = self.data_idx_order_list [next_idx] 
data_line = self.data_lines[file_idx] 


data_line = data_line.decode('utf-8') 
subsites — datawline strap \nl) es plecc CUNY) 
file_name = substr[0] 
label = substr[1] 
img_path = os.path.join(self.data_dir, file_name) 
data = {'img_path': img_path, 'label': label} 
af not os: path .exists amgupath)) = 
continue 
with open(data['img_path'], 'rb') as f: 
img = f.read() 
data['image'] = img 
data = transform(data, load_data_ops) 
if data is None: 
continue 
ext_data.append (data) 
return ext_data 


# Get image information 
def get_image_info_list(self, file_list): 
Z£ isinstance (file list, str) < 
filer last —  [cackes iarsti) 
data_lines = [] 
for idx, file in enumerate(file_list): 
with open(file, "rb") as f: 
lines = f.readlines() 
data_lines.extend(lines) 
return data_lines 


# Get a piece of data in the DataSet 

def _ getitem__(self, idx): 
file_idx = self.data_idx_order_list [idx] 
data_line = self.data_lines[file_idx] 
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try: 
data_line = data_line.decode('utf-8') 
Subsite ——Catamlaincmsiterio.(uNnw)\iers pile Gu Nt) 
file_name = substr[0] 
label = substr[1] 
img_path = os.path.join(self.data_dir, file_name) 
data = {'img_path': img_path, ‘'label': label} 
if not os.path.exists(img_path): 


(continued from previous page) 


raise Exception("{} does not exist!".format (img_path) ) 
with open(data['img_path'], 'rb') as f: 
img = f.read() 
data['image'] = img 
data['ext_data'] = self.get_ext_data (idx) 
outs = transform(data, self.ops) 
except Exception as e: 
print ( 
"When parsing line {}, error happened with msg: 


data_line, e)) 
outs = None 
if outs is None: 
return 
return outs 


def len (self): 
return len(self.data_idx_order_list) 


copy_paste_demo = CopyPasteDemo () 


idx = 1 

datal = copy_paste_demo [idx] 

print (datal.keys ()) 

prine (datali|(Mimgl pach) 

praint (daval|"exttdata™ )) [0i\Mamg pati" ])) 


dict keys ([timgupath’, ‘label, “amage', ‘extudata’, "polys", 
o']) 

./icdar2015/text_localization/icdar_c4_train_imgs/img_603.jpg 
./icdar2015/text_localization/icdar_c4_train_imgs/img_233.jpg 


¢ The following 2 pictures are the images before getting through CopyPaste. 


import cv2 

import matplotlib.pyplot as plt 
smatplotlib inline 

imgl = cv2.imread(datai1["img_path"]) 
img2 = cv2.imread(datai["ext_data"] [0] ["img_path"]) 
plt.figure(figsize=(10,6) ) 

jONRG patmislavonn (abel [e537 83-4 I) )) 
plt.show() 
plt.figure(figsize=(10, 6) ) 

pit. umshow (Gime2: less, 2: — 2) 
plt.show() 


‘texne’ | 


i?” -Eormar ¢ 


‘ignore_tags 
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¢ Draw the updated label detection box, where the red boxes refer to the original labelled information, and the blue 


boxes are the added label boxes after CopyPaste. 


import json 
infos = copy_paste_demo.data_lines [idx] 
infos = json.loads(infos.decode ('utf-8') .split ("\t") [1]) 


img3 = datai["image"] .copy () 
plt.figure (figsize=(15,10)) 
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joule qAlmislavony (SHINe St 8p By 3 3 lI) 
# Original labelled information 
for info in) infos’: 


Ri Se = eAaljon((esabioneron|[ Viereatianessy') I) }) 
xs = list (xs) 
ys = list (ys) 


xS.append(xs[0]) 
ys.append(ys[0]) 
jolie cislkoiel(Os. Wisi, Wiel!) 
# Added labelled information 
for poly_idx in range(len(infos), len(data1l["polys"])): 
poly = datai["polys"] [poly_idx] 
xS, yS = zip(*poly) 
xs = list (xs) 
ys = list (ys) 
xS.append(xs[0]) 
ys.append(ys[0]) 
jolie sjoklkoiats Nis lo!) 
plt.show() 


0 200 400 600 800 1000 1200 
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Summary of Text Detection Optimization 


In PP-OCRv?2, the knowledge distillation scheme and the data augmentation strategy are adopted for the text detection 
model to increase its generalization performance. With the detection model size remaining the same, Hmean has increased 
from 0.759 to 0.795. The ablation experiment is displayed below. 


Strategy Precision | Recall | Hmean mee aa ea 
PP-OCR det 0.718 0.805 | 0.759 3.0 129 
PP-OCR det + DML 0.743 0.815 | 0.777 3.0 129 
PP-OCR det + CML 0.746 0.835 | 0.789 3.0 129 
PP-OCR det + CML + CopyPaste 0.754 0.840 | 0.795 3.0 129 


Table 2: Ablation study of CML and CopyPaste for text detection. 


Figure 33: PP-OCRv?2 detection model ablation experiment 


The detection result of PP-OCRv2 is shown below. 


Figure 34: The detection result of PP-OCRv2 
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6.3.2 Detailed Explanation of Text Recognition Model Optimization 


PP-OCRv? text recognition model is optimized through the backbone network optimization, UDML knowledge distilla- 
tion strategy, and CTC loss improvement. Finally, the recognition accuracy increases from 66.7% to 74.8%. Its ablation 
experiments are as follows. 


Model Size Inference Time 


Strategy (CPU, ms) 
PP-OCR rec (MV3) 7.7 
PP-OCR rec (LCNet) 6.2 
PP-OCR rec (LCNet) + U-DML 6.2 
PP-OCR rec (LCNet) + U-DML + Enhanced CTC loss 6.2 


Table 3: Ablation study of LCNet, U-DML, and Enhanced CTC loss for text recognition. 


Figure 35: Ablation experiment of PP-OCRv2 recognition model 


PP-LCNet Lightweight Backbone Network 


Baidu has proposed a lightweight CPU network based on MKLDNN acceleration strategy, PP-LCNet, which can greatly 
improve the performance of lightweight models in image classification. In downstream tasks of computer vision, such 
as text recognition, target detection, and semantic segmentation, it performs well. It should be noted that PP-LCNet 
is customized for the CPU+MKLDNN scenario, and far better than other models on the speed and accuracy in the 
classification task. So if you need models in this scenario, PP-LCNet is recommended. 


A PP-LCNet paper : PP-LCNet: A Lightweight CPU Convolutional Neural Network 


PP-LCNet is improved from MobileNetV1, and its structure is shown below. 


output shape input 


16x112x112_ | Stem Conv/h-swish | x1 
64x56x56 3x3 DepthSepConv | x2 


7 
128x28x28 3x3 DepthSepConv {x2 


256x14x14 | 3x3 DepthSepConv | x2 


512x7x7 5x5 DepthSepConv | x7 
Represents the stem of the network, the standard convolutional layer 
Represents the 3x3 dw and 1x1 pw layer 
Represents the 5x5 dw and 1x1 pw layer 


[1280 FCh-swish _| Represents the global average pooling layer 
__ Represents the linear layer 


@ Represents the SE module 


Figure 36: PP-LCNet structure 


6.3. Interpretation of PP-OCRv2 Optimization Strategies 161 


Dive into OCR 


Compared with MobileNetV 1, PP-LCNet integrates the activation function, head structure, SE modules and other strategy 
optimization techniques in the MobileNetV3 structure. At the same time, it analyzes the convolution kernel size of the 
convolution layer in the final stage. In the end, the model is able to guarantee the speed and outperform lightweight models 
such as MobileNet and GhostNet in accuracy. 


PP-LCNet is optimized in four aspects: 


e Except for SE modules, all relu activation functions in the network are replaced with h-swish, and the accuracy is 
improved by 1%-2%; 


¢ In the fifth stage of PP-LCNet, the kernel size of DW turns into 5x5, and the accuracy is improved by 0.5%-1%; 


¢ The last two DepthSepConv blocks of the fifth stage of PP-LCNet add SE modules, and the accuracy is increased 
by 0.5%-1%; 


¢ A 1280-dimensional FC layer is added after the GAP layer to increase feature expression capability, and the accu- 
racy is increased by 2%-3%. 


On the ImageNet1k dataset, the comparison between PP-LCNet and other common lightweight classification models in 
Top1-Acc and the inference time are shown in the figure below. It can be seen that PP-LCNet excels in the two indicators. 


Inference Time 
(ms) 
3.88 
4.56 
4.20 
4.54 
6.63 
3.16 


Model 

MobileNetV 1-0.75x 
MobileNetV2-0.75x 
MobileNetV3-small-1.0x 
MobileNetV3-large-0.5x 
GhostNet-0.5x 
PP-LCNet-1.0x 


Figure 37: Comparison of state-of-the-art light networks over classification accuracy 
The PP-LCNet recognition model can be rapidly defined as follows. 


# Reference Code 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppocr/modeling/backbones/ 
«rec_mvi_enhance.py 

from ppocr.modeling.backbones.rec_mvi1_enhance import MobileNetViEnhance 


x = paddle.rand([1, 3, 23, 320]) 


model = MobileNetV1iEnhance (scale=0.5) 


y = model (x) 
print (y.shape) 
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U-DML Knowledge Distillation Strategy 


In the standard DML strategy, the distilled loss function only includes the supervision of the output layer. However, for 
two models with the identical structure, their intermediate features and output expectations should be the same. So in the 
supervision of the output layer, the supervision signal of the intermediate output feature map can be further added as a 
loss function, that is, the U-DML (Unified-Deep Mutual Learning) knowledge distillation method in PP-OCRv2. 


Th flow chart of U-DML knowledge distillation is shown below. The structure of the teacher network and the student 
network are exactly the same, and the initialization parameters are different. In addition, on the basis of the standard 
DML knowledge distillation, a new supervision mechanism for feature maps and a feature loss have been introduced. 


si ‘ 
I 


Student network [CTC toss | 
7:00 ~19:00 ; | -7:00~19:00 
CONV Bence : Convenience 


Label 


Figure 38: The flow chart of U-DML knowledge distillation 


In the training process, there are three types of loss: GT loss, DML loss, Feature loss. 
¢ GT loss 


The model structure in the text recognition task is CRNN, so CTC loss is used as GT loss. The calculation of GT loss is 
as follows. 


L088 ct¢ = CTC (Shout gt) + CTCL oats gt) 


¢ DML loss 


The calculation of DML loss is as follows. The teacher model and the student model calculate the KL divergence each 
other, and the DML loss is symmetrical. 


KL(Spoutl |Z esis) + KT aud | |S pout ) 
2 


LO88 gm) = 


¢ Feature loss 


Feature loss uses L2 loss, and its calculation is shown below. 
L088 feat = L2(Shouts Trout) 
Finally, the calculation of loss function during training is as below. 
LO88totq1 = LO88 e¢¢ + LOSS am + LOSS feat 


In addition, during the training process, the model effect is improved by increasing the iteration number, adding tricks 
such as the FC layer to the head, balancing the feature encoding and decoding capabilities of the model. 


The configuration file is inch_PP-OCRv2_rec_distillation.ymll] 
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Architecture: 
model_type: &model_type "rec" # Model category, like rec, det, etc., and each sub- 
onetwork of which is related to 
name: DistillationModel # Structure name, is DistillationModel in the distillation. 
otask, used to construct the corresponding structure 
algorithm: Distillation # Algorithm name 
Models: # Models, including the configuration information of subnets 
Teacher: # The name of the sub-network must contain ‘pretrained’ and °freeze_ 
soparams’, and the other parameters are the construction parameters of sub-networks 
pretrained: # Whether it is needed to load a pre-trained model in sub-networks 
freeze_params: false # Whether fixed parameters are required 
return_all_feats: true # The parameters of subnets indicate whether allu 
ofeatures need to be returned. If it is False, only the final output is returned 
model_type: *model_type # Model category 
algorithm: CRNN # The algorithm names of sub-networks. The remainingu 
saparticipation of the sub-network is the construction parameter, which is consistent. 


owith the general model in training configuration 
Transform: 
Backbone: 
name: MobileNetViEnhance 
scale: 0.5 
Neck: 
name: SequenceEncoder 
encoder_type: rnn 
hidden_size: 64 
Head: 
name: CTCHead 
mid_channels: 96 # Intersperse a layer in the head decoding process 
fc_decay: 0.00002 
Student: # Another sub-network, and here is a distillation example of DML. Thew 
otwo sub-networks share the same structure, and both need to learn parameters 
pretrained: # The following networking parameters are the same as above 
freeze_params: false 
return_all_feats: true 
model_type: *model_type 
algorithm: CRNN 
Transform: 
Backbone: 
name: MobileNetViEnhance 
scale: 0.5 
Neck: 
name: SequenceEncoder 
encoder_type: rnn 
hidden_size: 64 
Head: 
name: CTCHead 
mid_channels: 96 
fc_decay: 0.00002 


Of course, if you want to add more sub-networks for training, you can also add the corresponding fields in the configuration 
file according to the way of adding Student and Teacher. For example, if you want three models to supervise each 
other and train together, then Architecture can be written in the following way. 


Architecture: 
model_type: &model_type "rec" 
name: DistillationModel 
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algorithm: Distilla 
Models: 

Teacher: 
pretrained: 
freeze_params: 
return_all_feat 
model_type: *mo 
algorithm: CRNN 
Transform: 
Backbone: 

name: MobileN 

scale: 0.5 
Neck: 

name: Sequenc 


tion 


false 
s: true 
del_type 


etViEnhance 


eEncoder 


encoder_type: rnn 
hidden_size: 64 
Head: 
name: CTCHead 
mid_channels: 96 
fc_decay: 0.00002 
Student: 
pretrained: 


freeze_params: false 
return_all_feats: true 
model_type: *model_type 
algorithm: CRNN 
Transform: 
Backbone: 
name: MobileNetViEnhance 
scale: 0.5 
Neck: 
name: SequenceEncoder 
encoder_type: rnn 
hidden_size: 64 
Head: 
name: CTCHead 
mid_channels: 96 
fc_decay: 0.00002 
Student2: 


# Introduces new sub-networks in the knowledgeu 


odistillation task, and the other parts remain the same as the above configuration 


pretrained: 
freeze_params: false 
return_all_feats: true 
model_type: *model_type 
algorithm: CRNN 
Transform: 
Backbone: 
name: MobileNetViEnhance 
scale: 0.5 
Neck: 
name: SequenceEncoder 
encoder_type: rnn 
hidden_size: 64 
Head: 
name: CTCHead 
mid_channels: 96 
fc_decay: 0.00002 
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When the model is trained, it contains 3 sub-networks: Teacher, Student, and Student 2. 
For code example of the distillation model Dist illationModel class, refer to distillation_model.py. 


The final output of the model forward is a dictionary, and the key is the names of all the sub-networks. For example, 
here are Student and Teacher, and the value is the output of the corresponding sub-network, which can be Tensor 
(only returns the last network layer) and dict (also returns the middle feature information). 


In the recognition task, in order to add more loss functions and ensure the scalability of the distillation method, the output 
of each sub-network is saved as a dict, which contains the sub-module output. Take this recognition model as an 
example, the output result of each subnet is dict, the key includes backbone_out, neck_out, head_out, and 
value is the tensor of the corresponding module. Finally, based the above configuration file, Dist illationModel 
is output as: 


{ 


"Teacher": { 
"backbone_out": tensor, 
"neck_out": tensor, 
"head_out": tensor, 


}, 
"Student": { 


"backbone_out": tensor, 
"neck_out": tensor, 
"head_out": tensor, 


In the knowledge distillation task, the loss function configuration is as follows. 


Loss: 

name: CombinedLoss # Loss function name, based on the name change, build a classi 
ofor loss function 
loss_config_list: # Loss function configuration file list, a necessary function for. 
«CombinedLoss 
-DistillationCTCLoss: # CTC loss function based on distillation, inherited from. 
«Standard CTC loss 

weight: 1.0 # The weight of the loss function. In loss_config_list, the. 
sconfiguration of each loss function must include this field 

model_name_list: ["Student", "Teacher"] # For the prediction results of the. 
odistillation model, extract the output of these two sub-networks and calculate the. 
oCTC loss with gt 

key: head_out # Take the tensor corresponding to the key in the subnet outputu 
edict 
-DistillationDMLLoss: # Distilled DML loss function, inherited from standard DMLLoss 

weight: 1.0 # weight 

act: "softmax" # Activation function, use activation function to process the. 
sinput, it can be softmax, sigmoid or None, the default is None 

model_name_pairs: # The subnet name pairs used to calculate DML loss. If youu 
swant to calculate the DML loss of other subnets, you can continue to fill in the. 
olist below 

- ["Student", "Teacher"] 

key: head_out # Take the tensor corresponding to the key in the subnet output. 
edict 
-DistillationDistanceLoss: # Distillation distance loss function 

weight: 1.0 # weight 

mode: "12" # Distance calculation method, currently supports 11, 12, smooth_1l1 

model_name_pairs: # Subnet name pairs used to calculate distance loss 

["Student", "Teacher"] 
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key: backbone_out # Take the tensor corresponding to the key in the subnetu 
soutput dict 


Among the above loss functions, all distillation loss functions are inherited from standard loss functions. They analyzes 
the output of the distillation model, find the tensor used for loss calculation, and then use standard loss functions class for 
calculation. 


With the above configuration, the loss function in the distillation training contains three parts. 


-The final output (head_out) of Student and Teacher and the CTC loss of gt, with a weight of 1. Here, because 
both sub-networks need to update parameters,they need to calculate the loss with gt. -The DML loss between Student 
and Teacher’s output (head_out), with a weight of 1. -The 12 loss between Student and Teacher’s backbone 
network output (backbone_out), with a weight of 1. 


The CombinedLoss is implemented as follows. 


class CombinedLoss(nn.Layer): 
moe 
CombinedLoss: 
a combionation of loss function 


mon 


def __init__(self, loss_config_list=None) : 
super ().__init__() 
self.loss_func = [] 
self.loss_weight = [] 
assert isinstance(loss_config_list, list), ( 
‘operator config should be a list") 
for config in loss_config_list: 
assert isinstance(config, 
dict) and len(config) == 1, "yaml format error" 
name = list (config) [0] 
param = config[name] 
assert "weight" in param, "weight must be in param, but param justu 
econtains {}".format ( 
param.keys () ) 
self.loss_weight.append(param.pop ("weight") ) 
self.loss_func.append(eval (name) (**param) ) 


def forward(self, input, batch, **kargs): 
loss_dict = {} 
loss_all = 0. 
for idx, loss_func in enumerate(self.loss_func): 
loss = loss_func(input, batch, **kargs) 
if isinstance(loss, paddle.Tensor): 
loss = {"loss_{}_{}".format (str(loss), idx): loss} 


weight = self.loss_weight [idx] 
loss = {key: loss[key] * weight for key in loss} 


Tf “losis” in.loss® 

loss_all += loss["loss"] 
else: 

loss_all += paddle.add_n(list (loss.values())) 
loss_dict.update (loss) 
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loss_dict["loss"] = loss_all 
return loss_dict 


For implementation of CombinedLoss, please refer to: combined_loss.py. For implementation of distillation loss 
functions such as DistillationCTCLoss, please refer to distillation_loss.py. 


In the distillation of the three models, the loss field also needs to be modified, and the loss between the three sub-networks 
also need to be considered. 


Loss: 
name: CombinedLoss # Loss function name, build a class for the loss function based. 
on the name chang 
loss_config_list: # Loss function configuration file list, a necessary function for. 
o«CombinedLoss. 
-DistillationCTCLoss: # CTC loss function based on distillation, inherited from. 
ostandard CTC loss 
weight: 1.0 # The weight of the loss function. In loss_config_list, the. 
sconfiguration of each loss function must include this field 
model_name_list: ["Student", "Student2", "Teacher"] # For the prediction. 
oresults of the distillation model, extract the output of these three sub-networks.. 
sand calculate the CTC loss with gt 
key: head_out # Take the tensor corresponding to the key in the subnet outputu 
waict 
-DistillationDMLLoss: # Distilled DML loss function, inherited from standard DMLLoss 
weight: 1.0 # weight 
act: "softmax" # Activation function, it can be softmax, sigmoid or None, and. 
othe default is None 
model_name_pairs: # The subnet name pairs used to calculate DML loss. If youu 
sewant to calculate the DML loss of other subnets, you can continue to fill in the. 
olist below 
-["Student", "Teacher"] 
—-["Student2", "Teacher"] 
-["Student", "Student2"] 
key: head_out # Take the tensor corresponding to the key in the subnet outputu 
~aice 
-DistillationDistanceLoss: # Distillation distance loss function 
weight: 1.0 # weight 
mode: "12" # Distance calculation method, currently supports 11, 12, smooth_1l1 
model_name_pairs: # Subnet name pairs used to calculate distance loss 
-["Student", "Teacher"] 
-["Student2", "Teacher"] 
-["Student", "Student2"] 
key: backbone_out # Take the tensor corresponding to the key in the subnetu 
soutput dict 


# Download data 

'wget -nc https://paddleocr.bj.bcebos.com/dataset/rec_data_lesson_demo.tar && tar -xfu 
orec_data_lesson_demo.tar && rm rec_data_lesson_demo.tar 

# # Download the pre-trained model 

'wget -nc https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_train.tar & 
oo tans sce ChabP-OCRV2n rece uratnamara se. ernie Che P—OCRVCasre Cmibasaunnr tats 


'python tools/train.py -c configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec_distillation.yml \ 
-o Train.dataset.data_dir="./rec_data_lesson_demo/" \ 
Tratnecatasct. labeliorailenikst—| a recudatanlessonndemo/terannatxte ll) \ 
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Train.loader.num_workers=0 \ 
Train.loader.batch_size_per_card=64 \ 


Eval.dataset.data_dir="./rec_data_lesson_demo/" \ 
Evialencdatasectplabeciaerulem hrs t—|[tUN/ re cmodibamkessonmoemoyviclatexty |lmN 
Eval.loader.num_workers=0 \ 


Optimizer.lr.values=[0.0001,0.00001] \ 
Global.epoch_num=1 \ 
Global.pretrained_model="./ch_PP-OCRv2_rec_train/best_accuracy" 


Enhanced CTC Loss 


In the task of recognizing Chinese characters with OCR, there are too many similar characters, which are easy to be 
misrecognized. Learning from Metric Learning, OCR has introduced Center loss to enlarge the distance between classes. 
The core formula is shown below. 


LD=Ly.tA*D 


ctc center 


T 
Di siiitens = S- Ilr, > Cy, [15 
t=1 


Here x, represents the label at the time step ¢, and c, represents the center corresponding to the label y,. 
t 


In Enhanced CTC, the center initialization also has great influence on the result. In PP-OCRv2, the specific steps of 
center initialization are as follows. 


1. Train a network based on the standard CTC loss; 
2. Extract the correct image set from the training set and mark it as G; 


3. Input the pictures in G into the network one by one, and extract the corresponding relationship between x, and y, 
of sequence features output by the head. The calculation of y, is as follows: 


Yy, = argmax(W * x,) 
1. Aggregate x, with y,, and average them as the initial center. 
First, train a basic network with configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec.yml 
For more training steps about center loss, please refer to: Enhanced CTC Loss Usage Document 


Finally, use configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec_enhanced_ctc_loss.yml for training, and the command is as 
follows. 


python tools/train.py -c configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec_enhanced_ctc_loss.yml 


The Loss field is the focus. Compared with the standard CTCLoss, CenterLoss is added here. Then configure the 
number of categories, feature dimensions, and the center path. 


Loss: 

name: CombinedLoss 

loss_config_list: 

— CTCLoss: 
use_focal_loss: false 
weight: 1.0 

— CenterLoss: 
weight: 0.05 
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num_classes: 6625 
feat_dim: 96 
center_file_path: "./train_center.pkl" 


Summary of Text Recognition Optimization 


In the process of optimizing the PP-OCRv2 text recognition model, the model was improved in the backbone network, 
loss function and so on. And the knowledge distillation training method has been introduced, increasing the recognition 


accuracy from 66.7 % to 74.8%. The ablation experiment is as below. 


Model Size Inference Time 


Strategy (CPU, ms) 
PP-OCR rec (MV3) a 
6.2 


PP-OCR rec (LCNet) 
PP-OCR rec (LCNet) + U-DML 
PP-OCR rec (LCNet) + U-DML + Enhanced CTC loss 


Table 3: Ablation study of LCNet, U-DML, and Enhanced CTC loss for text recognition. 


Figure 39: Ablation experiment of PP-OCRv2 recognition model 


Based on the PP-OCRv2 text detection, the experiment result of the recognition model is as follows. 
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Figure 40: The text detection result of PP-OCRv2 
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6.4 Summary 


This chapter mainly talks about the optimization strategies of PP-OCR and PP-OCRv2. 


PP-OCR uses 19 strategies in the backbone network, learning rate strategy, data augmentation, and model tailoring and 
quantization, to optimize the model, creating a PP-OCR server system and a PP-OCR mobile system. 


Compared with PP-OCR, PP-OCRv2 make more improvements in the three aspects— backbone network, data augmen- 
tation, and loss function— to tackle the poor end-to-side inference efficiency, complex background, and misrecognition 
of similar characters. At the same time, it introduces the knowledge distillation training strategy to further improve the 
accuracy of the model. Finally, PP-OCRv2 builds a text detection and recognition system outperforming PP-OCR on 
accuracy and speed. 


6.5 Assignment 


For details, refer to Optimization Strategy Objective Questions and Optimization Strategy 
Practical Questions in the compulsory assignment column. 
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CHAPTER 
SEVEN 


INFERENCE AND DEPLOYMENT OF PP-OCRV2 


7.1 Overview of Inference and Deployment 


This chapter mainly introduces the high-performance inference, service deployment and mobile-device deployment of 
the PP-OCRv2 system. Through the study of this chapter, you can learn: 


¢ How to choose appropriate inference deployment methods in different scenarios 
¢ Inference methods of PP-OCRv2 models in various scenarios 


¢ Inference deployment methods of Paddle Inference, Paddle Serving, and Paddle Lite 


7.1.1 Introduction 


In the previous chapters, a model has been trained by using methods of model training. When using it to predict, we 
need to define the model first, then load the trained model, and send preprocessd data to the network for inference and 
post-processing, and get the final result. This inference method is convenient for debugging, but low in efficiency. 


For this, there are two offline inference solutions. 


1. Inference based on the training engine uses the same engine as the training. It is convenient for debugging and can 
help us quickly locate problems and make verfication. Its programming language is mostly Python. 


2. Inference based on the inference engine transforms the trained model and removes the parts irrelevant to the infer- 
ence. This method can speed up the inference. And its programming language is Python language or C++. 


The differences between the two are shown below. 


Inference based on the training engine Inference based on the inference engine 
Fea- 1. Use the same engine as the training. 2. The | 1. The model needs to be converted, with irrelevant 
tures network model needs to be defined in inference. | parts removed. 2. * There is no need to define a network 
3. It is applicable to system integration model*. 3. It is suitable for system integration 
Pro- Python Python or C++ 
gram- 
ming 
lan- 
guage 
Infer- 1. Define the network structure on the Python | |. Prepare the input data. 2. Load the model structure 
ence side. 2. Prepare the input data. 3. Load the | and model Parameters. 3. Perform inference 
steps training model. 4. Perform inference 


In the offline inference deployment, it is more appropriate to perform inference based on the inference engine. 
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PaddlePaddle provides the following inference deployment solutions for different application scenarios. 


Deployment Feature Scene Hardware 
X86 CPU 
‘ Complex model algorithm NVIDIA GPU 
Paddle Inference svalotnae Hardware with high performance Loongson/Feiteng and other domestic CPUs 
P Al acceleration chips such as Kunlun/Ascend/ 

Haiguang DCU 
Arm CPU 

7 2 Arm / Qualcomm / Apple GPU 

Paddle Lite Lite-weight ee bac limited Al acceleration hardware such as Kunlun / 
iscellaneous hardware, limited resources, Ascend / Kirin / Rockchip / Yingmai / 
low power consumption 


Cambrian / Bitmain 


Large traffic, high concurrency, low latency, large 
throughput 

Resource elasticity regulation to cope with changes 
in service traffic 


X86 / Arm CPU 
NVIDIA GPU 
Kunlun / Shengteng, etc. 


* Model combination, encryption, hot update, etc. 


Inference on the : 
browser 


‘ble’ Third-party inference framework * — Horizon X3 

Paddle2ONNX Open and compatible . support deployment in more AI chips * — Corerain X3 
+ Allwinner R329 
* Other AI chips 


Figure 1: Different deployment options offered by PaddlePaddle 


PaddleOCR provides three inference deployment solutions for different application scenarios. 


¢ Offline Inference (Paddle Inference). It is mainly used in scenarios where the timeliness of the inference response 
is not high, especially in those requiring a large number of picture for inference, such as document digitization,and 
advertising information extraction. Although the inference request cannot be responded in real time, there is no 
network latency, making the calculation much more efficient and ensuring the data security. 


¢ Service deployment (Paddle Serving). It is mainly used in scenarios requring high timeliness of inference response, 
such as the APIs of commercial OCR, real-time photo translation, and snapping for answers. Although this method 
can make real-time response to inference, there will be high internet expenses, inefficient use of GPU, and data 
security risks. 


¢ Mobile-device deployment (Paddle Lite). It aims to deploy the model to mobile devices such as mobile phones and 
robots out of convenience and data security. It is similar to ID card and bank card recognition in mobile APP, and 
instrument monitoring and recognition in industrial application scenarios. This method is more sensitive to the 
size of the OCR model. Although there is no network latency and low data security risks, its inference prediction 
efficiency is not high due to the limitation of computing power. 


Based on PP-OCRvz?2, this chapter will talk about the process of text detection, recognition, and the end-to-end inference 
deployment process. 


7.1.2 Environment Preparation 


To participate in this chapter, you need to download the PaddleOCR code first and install related dependencies. The 
commands are as follows. 


import os 
os .chdar ("/home/arstudio™) 


# Download code 
'git clone https://gitee.com/paddlepaddle/PaddleOCR.git 
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os.chdir("/home/aistudio/PaddleOCR") 

# Install and run the required whl package 
Woaljo; sbinsieeulil 0) jeuljo) 

‘pip install -r requirements.txt 

# This library is needed in VOA tasks 

!pip install paddlenlp==2.2.1 


# Import some libraries 

import cv2 

import matplotlib.pyplot as plt 
smatplotlib inline 

import numpy as np 

import os 


7.2 Python Inference Based on Paddle Inference 


7.2.1 Introduction 


In a project, the inference performance of the model directly affects the cost, so it is expected that there is a trained model 
having a faster inference speed. If inference is directly performed on the training engine, the efficiency is relatively low 
for the model contains training-related operators. And it is needed to define the model, making it difficult to decouple 
from the training code. In this case, Paddle Inference came into being. It is a native inference library of PaddlePadlle, 
which acts on the server and the cloud to provide high-performance inference. Since its is based on the training operator 
of PaddlePaddle, Paddle Inference supports all models trained by PaddlePaddle. 


Considering that every user’s application scenarios vary, Paddle Inference has carried out in-depth adaptation for different 
platforms and different scenarios. It is high in throughput and low in latency, ensuring that trained PaddlePaddle models 
are ready-to-use on the server, and can be deployed fast. 


This chapter will talk about introduces the inference of PP-OCRv2 with Paddle Inference. To learn more about Paddle 
Inference, please refer to: Paddle Inference Introduction. 


In the model inference with Paddle Inference, there are several steps. 
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Figure 2: Inference process of Paddle Inference 


The PP-OCRv2 system includes three models: text detection, direction classifier, and text recognition. The following will 
intorduce the inference process of these three models based on Paddle Inference. 


7.2.2 PP-OCRv2 Text Detection Model Inference 


In PaddleOCR, when inference is based on the text detection model, it is needed to designate an image or the path of an 
image collection through the parameter image_dir, and the parameter det_model_dir specifies the path of the 
detected inference model. 


In the following,we will perform the inference of the latest ultra-lightweight text detection model. For more models and 
using methods, please refer to Text Detection Prediction Tutorial. 


To learn more about to the hyperparameters of other algorithms, please refer to PaddleOCR Inference related parameters 
introduction. 
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Data and Environment Preparation 


At the very beginning, Paddle and the corresponding dependencies are installed, and the environment is ready. 


The test data is in the doc/imgs folder, and parts of the data are shown below. 


# Switch the directory 
os.chdir ("/home/aistudio/PaddleOCR") 


# View data 
!ls doc/imgs/ 


# Choose 2 images for visualization 


imgtl = 
img2 = 


cv2.imread ("doc/imgs/00006737. jpg") 
cev2.imread ("doc/imgs/00056221.jpg") 


plt.figure(figsize=(15, 6)) 
joukicnseiblojeykione (il. 245 ib) 

joukie « aLmnSlovonn(aumenl| S26, Bs —ab |) 

plexsubplore Gy 2; 2) 

fodkte mis la Owes (EUG alesepce cece ||) 

plt.show() 


00006737.4pg 00056221.jpg 00111002.4pg 
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Download the inference model, unzip it, and place it in the inference. 


# Download model 


Imkdir 


inference 


Shang HaiHongQiao 


!'cd inference && wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_ 


det_infer.tar 


'tree 


h inference/ch_PP-OCRv2_det_infer 


O ch_PP-OCRv2_det_infer.tar && tar -xf ch_PP-OCRv2_det_infer.tar 
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—-2021-12-25 14:55:13-- https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP- 
sOCRv2_det_infer.tar 
Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 100.67.200.6 


Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |100.67.200.6|:443.. 
os. connected. 

HTTP request sent, awaiting response... 200 OK 

Length: 3190272 (3.0M) [application/x-tar] 

Saving to: ‘ch_PP-OCRv2_det_infer.tar’ 


ch_PP-OCRv2_det_inf 100%[ >] 3.04M --.-KB/s ane Oe O Fs 
2021-12-25 14:55:13 (42.2 MB/s) - ‘ch_PP-OCRv2_det_infer.tar’ saved [3190272/ 
S190 27 /2)) 


inference/ch_PP-OCRv2_det_infer 

t— [2.2M] inference.pdiparams 

-t— [ 23K] inference.pdiparams.info 
— [845K] inference.pdmodel 


0 directories, 3 files 


¢ If you want to export the model you have trained and deploy it with Paddle Inference, you can use the following 


commands to convert the pre-trained model into an inference model through transforming dynamic diagrams into 
static diagrams. 


# Reference Code 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/tools/export_model.py 
# 

| 


Download the pre-trained model 


get https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_distill_train. 


Ww 
etar &£& tar =xf ch_PP-OCRv2_det_cistill train.tar £6 rm ch_PP-OCRV2 det _distill_ 
Sbratmetars 


# Export inference model 


'python tools/export_model.py -c configs/det/ch_PP-OCRv2/ch_PP-OCRv2_det_cml.yml \ 
-o Global.pretrained_model="ch_PP-OCRv2_det_distill_train/best_accuracy" \ 
Global.save_inference_dir="./my_model" 

# The PP-OCRv2 detection model contains three sub-networks: teacher, student, and. 
estudent 2. Therefore, there are thr sub-files in export, but only one studentu 
onetwork is needed in inference. 

‘tree -h my_model 


--2021-12-25 14:55:21-- https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP- 
sOCRv2_det_distill_train.tar 
Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 100.67.200.6 


Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |100.67.200.6|:443.. 
os. connected. 


HTTP request sent, awaiting response... 200 OK 
Length: 63830016 (61M) [application/x-tar] 
Saving to: ‘ch_PP-OCRv2_det_distill_train.tar’ 


Ch PP-OCRv42 -det_dus 100s >] 60.87M 81.4MB/s 


an ONS 


2021-12-25 14:55:22 (81.4 MB/s) - ‘ch_PP-OCRv2_det_distill_train.tar’ savedu 
+[63830016/63830016] 


W1225 14:55:24.746377 1078 device_context.cc:447] Please NOTE: device: 0, GPUuUW 
sCompute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 
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W1225 14:55:24.749907 
2020/12/25) L455 e30 
odistill_train/best_ 
2020/12/25 1455561 
sinference 
202/127 25 AS 5233 
sinference 
ZO 2A 2 So AAS boa 
soinference 
my_model 
4.0K] 


1078 device_context.cc:465] device: 0, cuDNN Version: 7.6. 
root INFO: load pretrain successful from ch_PP-OCRv2_det_ 
accuracy 
root INFO: 


inference model is saved to ./my_model/Teacher/ 


root INFO: inference model is saved to ./my_model/Student/ 


root INFO: inference model is saved to ./my_model/Student2/ 


Student 


f—— 4s 


2.2M 
23K 
961K 
OK] 
2..2M 
23K 


inference 
inference 
inference 


Student2 


inference 
inference 


.pdiparams 
.pdiparams 
-pdmodel 


.pdiparams 
.pdiparams 


Falls quiero] 


Rls g BEKO} 


—— {le 


962K inference 
OK] Teacher 
47M inference 
12K inference 
568K inference 


-pdmodel 


.pdiparams 
-pdiparams.info 
-pdmodel 


3 directories, 9 files 


A Preliminary Study of Text Detection Functions 


Let’s first take a look at the results after loading the inference model. 


# Reference Code 

# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/tools/infer/predict_det. 

SPY 

# Inference 

!python tools/infer/predict_det.py image_dir="./doc/imgs/00018069.jpg" --det_model_ 
dir="./inference/ch_PP-OCRv2_det_infer" use_gpu=Fals 


# Read the image and display it, and show the result 


plt.figure(figsize=(20, 8)) 


img_ori = cv2.imread("./doc/imgs/00018069. jpg") 

img_out = cv2.imread("./inference_results/det_res_00018069.jpg") 
puLeosWlojolliow (il, 2 1) 

follnG ,alsieveny (suine Creal [Pep 85 ek ail) 

jouLG oiblloyokene (al, 2p 2) 

Dit tmshowi Gime mo uita lassen te) 

plt.show() 


The following information will be printed. 


[2021/12/25 20:48:47] root INFO: 00018069.jpg [{[378, 249], 


The following figure will be displayed. 
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"| RS mie GR S20 Sti 
ALT Ames 35.6  O--d0 U/L 
so{ TBIL BARAT 11.2 {20 umol /L 
DBIL AAT 3.3 0--7  umol /L 
IBIL Teed cca 7.9 15-15 umol/L 
TP BBA 58.94  60--80 z/L 
ALB AEA 35.1 33-55 gL 
150 |__ GLO HEA 23.8 | 20--30 g/L 
ASG AEREL 1.5 1.5--2.5 
ALP ia HERBS 93 15--112  IUAL 
200] GGT Sea i 14.3 <50 UAL 
AST Seea as 16.3 g--40 U/L 
LDH FLESH S88 167. 114--240 U/L 
20) ADA apa 12.6 4-04 UAL 


100 150 200 


250 300 350 400 


So how does it work? The following is a detailed explanation of the loading of the inference model and inference code 


of PP-OCRv?2. 


First, you need to set the parameters as shown below. To learn more the parameters, please refer to: Introduction about 
parameters in the PaddleOCR inference process. 


# Reference Code 

# https://github.com/P 
import argparse 

import os 

import sys 

import cv2 

import numpy as np 
import paddle 

from PIL import Image, 
import math 

from paddle import inf 
import time 

from ppocr.utils.loggi 


def str2bool(v): 


return v.lower() i 


def init_args(): 


addlePaddle/PaddleOCR/blob/release/2.4/tools/infer/utility.py 


ImageDraw, ImageFont 
erence 

ng import get_logger 
yea 


n (Vecue! UTA) 


parser argpars 
# params for predi 
parser.add_argumen 
add_argumen 
add_argumen 
add_argumen 
add_argumen 
add_argumen 


parser. 
parser. 
parser. 
parser. 
parser. 


# params for text 


ArgumentParser () 
ction engine 

E(k) joes 
Wikies ey oneeeia!- 
wW—=se tensorrt", 


default=True) 

default=True) 
type=str2bool, default=False) 
"--min_subgraph_size", type=int, default=15) 
type=str, default="fp32") 
type=int, default=500) 


type=str2bool, 
type=str2bool, 


—-precision", 
-—-gpu_mem", 


(ar ter iar ier Xn 


" 


detector 


parser.add_argument ("--image_dir", type=str) 

parser.add_argument ("--det_algorithm", type=str, default='DB') 

parser.add_argument ("--det_model_dir", type=str) 

parser.add_argument ("--det_limit_side_len", type=float, default=960) 
Ea 


parser.add_argumen 


# DB parmas 
parser.add_argumen 


"—-det_limit_type", type=str, default='max') 


t ("--det_db_thresh", type=float, default=0.3) 
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parser.add_argument ("--det_db_box_thresh", type=float, default=0.6) 
parser.add_argument ("--det_db_unclip_ratio", type=float, default=1.5) 
parser.add_argument ("--max_batch_size", type=int, default=10) 
parser.add_argument ("--use_dilation", type=str2bool, default=False) 
parser.add_argument ("--det_db_score_mode", type=str, default="fast") 

# EAST parmas 

parser.add_argument ("--det_east_score_thresh", type=float, default=0.8) 
parser.add_argument ("--det_east_cover_thresh", type=float, default=0.1) 
parser.add_argument ("--det_east_nms_thresh", type=float, default=0.2) 

# SAST parmas 

parser.add_argument ("--det_sast_score_thresh", type=float, default=0.5) 
parser.add_argument ("--det_sast_nms_thresh", type=float, default=0.2) 
parser.add_argument ("--det_sast_polygon", type=str2bool, default=False) 
# PSE parmas 

parser.add_argument ("--det_pse_thresh", type=float, default=0) 
parser.add_argument ("--det_pse_box_thresh", type=float, default=0.85) 
parser.add_argument ("--det_pse_min_area", type=float, default=16) 
parser.add_argument ("--det_pse_box_type", type=str, default='box') 
parser.add_argument ("--det_pse_scale", type=int, default=1) 

# params for text recognizer 

parser.add_argument ("--rec_algorithm", type=str, default='CRNN') 
parser.add_argument ("--rec_model_dir", type=str) 

parser.add_argument ("--rec_image_shape", type=str, default="3, 32, 320") 
parser.add_argument ("--rec_batch_num", type=int, default=6) 
parser.add_argument ("--max_text_length", type=int, default=25) 
parser.add_argument ( 


ice! (elavetie. solhlohe jovhele 

LYpe=sLr, 

default="./ppocr/utils/ppocr_keys_v 
parser.add_argument ("--use_space_char", 
parser.add_argument ( 

MS\iale) eONpIC FONE ION 


EYDE=SEL, 


Wpieese)) 


type=str2bool, default=True) 


default="./doc/fonts/simfang.ttf") 


parser.add_argument ("--drop_score", 
# params for e2e 

add_argumen 
add_argumen 
add_argumen 


add_argumen 


parser. "—-e2e_ algorithm", 


" Zeamocd elements iy 


parser. 
parser. 


t ( 
ea 
Ea 
parser. t ("--e2e_limit_type", 


# PGNet parmas 


type=float, 


type=str, 
type=str) 
"—-e2e limit_side_len", 


default=0.5) 


default='PGNet') 


type=float, default=768) 
type=str, default='max') 


parser.add_argument (" 
parser.add_argument ( 
"--e2e char_dict_path", 


type=str, d 


2e_pgnet_score_thresh", 


type=float, default=0.5) 


parser.add_argument (" 
parser.add_argument (" 


2e_pgnet_mode", 


# params for text classifier 


2e_pgnet_valid_set", 


Feeble! 5 //joyolelenay/ Mhealllisy/falfetilis) (eliiiche mieode 4 )) 
type=str, default='totaltext') 
type=str, default='fast') 


parser.add_argument ("--use_angle_cls", type=str2bool, default=False) 
parser.add_argument ("--cls_model_dir", type=str) 

parser.add_argument ("--cls_image_shape", type=str, default="3, 48, 192") 
parser.add_argument ("--label_list", type=list, default=['0', '180']) 
parser.add_argument ("--cls_batch_num", type=int, default=6) 
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parser.add_argument ("--cls_thresh", type=float, default=0.9) 


parser.add_argumen 
parser.add_argumen 
parser.add_argumen 
parser.add_argumen 


("--enable_mkldnn", type=str2bool, default=False) 
("--cpu_threads", type=int, default=10) 
("--use_pdserving", type=str2bool, default=False) 
("--warmup", type=str2bool, default=False) 


# 
parser.add_argument ( 

"—-_draw_img_save_dir", type=str, default="./inference_results") 
parser.add_argument ("--save_crop_res", type=str2bool, default=False) 
parser.add_argument ("--crop_res_save_dir", type=str, default="./output") 


# multi-process 
parser.add_argumen 


("--use_mp", type=str2bool, default=False) 


parser.add_argument ("--total_process_num", type=int, default=1) 


parser.add_argumen 


parser.add_argumen 
parser.add_argumen 


("--process_id", type=int, default=0) 


("--benchmark", type=str2bool, default=False) 
("--save_log_path", type=str, default="./log_output/") 


parser.add_argument ("--show_log", type=str2bool, default=True) 

parser.add_argument ("--use_onnx", type=str2bool, default=False) 

# It should be noted here that this is added because if you parse directly in the. 
onotebook, sys.argv will add the following content at the back, causing the parsing. 
So) fain: 


# 


'"-f', '/home/aistudio/.local/share/jupyter/runtime/kernel—e1221262-c656-4129- 


Sfepelaue— iol Conor Mveeln, Iola V 
parser.add_argument ("-f", type=str, default=None) 
return parser 


def parse_args(): 
parser = init_args() 
return parser.parse_args () 


Let’s look at the code for text detection. 


# Reference Code 
# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/tools/infer/predict_det. 


SIEM 
import 
import 
import 
import 
import 


import 


os 

sys 

cv2 

numpy as np 
time 


tools.infer.utility as utility 


from ppocr.utils.logging import get_logger 

from ppocr.utils.utility import get_image_file_list, check_and_read_gif 
from ppocr.data import create_operators, transform 

from ppocr.postprocess import build_post_process 


import json 
logger = get_logger () 
# Text detection 
(continues on next page) 
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class TextDetector (object): 


def Seinit ss (sein, asgs)s: 
self.args = args 
self.det_algorithm = args.det_algorithm 
pre_process_list = [{ 
"DetResizeForTest': { 


'limit_side_len': args.det_limit_side_len, 
'limit_type': args.det_limit_type, 

} 

re 3 

"NormalizeImage': { 
Veeco (0.225), OW, 2245 0). 225, 
‘'mean': [0.485, 0.456, 0.406], 
Vs@euke) S Val 2a" , 
iRonatclene ee ioe) 


'ToCHWImage': None 


"KeepKeys': { 
"keep_keys': ['image', 'shape'] 


}] 


postprocess_params = {} 

if self.det_algorithm == "DB": 
postprocess_params['name'] = 'DBPostProcess' 
postprocess_params["thresh"] = args.det_db_thresh 
postprocess_params["box_thresh"] = args.det_db_box_thresh 
postprocess_params["max_candidates"] = 1000 
postprocess_params["unclip_ratio"] = args.det_db_unclip_ratio 
postprocess_params["use_dilation"] = args.use_dilation 
postprocess_params["score_mode"] = args.det_db_score_mode 

else: 


logger.info("unknown det_algorithm: {}".format (self.det_algorithm) ) 
sys.exit (0) 
# Initialize the inference engine 


self.predictor, self.input_tensor, self.output_tensors, self.config = utility. 
ocreate_predictor ( 


args, ‘det', logger) 
# Build the preprocessing operator 
self.preprocess_op = create_operators(pre_process_list) 
# Build the postprocessing operator 
self.postprocess_op = build_post_process (postprocess_params) 


def order_points_clockwise(self, pts): 
sory 
Refer to: https://github.com/jrosebri/imutils/blob/master/imutils/perspective. 
oPpy 
Sort out the detected points clockwise 


mon 


xSorted = pts[np.argsort(pts[:, 0]), :] 


leftMost = xSorted[:2, :] 
ELGheMost = exSoreed [2iya cul 
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leftMost = leftMost[np.argsort(leftMost[:, 1]), :] 
(el, bl) = LefiMost 


rightMost = rightMost[np.argsort (rightMost[:, 1]), :] 
(tr, br) = rightMost 


rect = np.array([tl, tr, br, bl], dtype="float32") 
return rect 


def clip_det_res(self, points, img_height, img_width): 
# Limit the detection results according to the width and height to prevent. 
othem from exceeding the image boundaries 
for pno in range(points.shape[0]): 
points[pno, 0] = int(min(max(points[pno, 0], 0), img_width - 1)) 
points|pno, 1] = ant (min(masx(polntsilpone, Ui), 0), amglherght — 1))) 
return points 


def filter_tag_det_res(self, dt_boxes, image_shape) : 
# Remove test results smaller than a certain size 
img_height, img_width = image_shape[0:2] 
dt_boxes_new = [] 
for box in dt_boxes: 


box = self.order_points_clockwise (box) 
box = self.clip_det_res (box, img_height, img_width) 
rect_width = int (np.linalg.norm(box[0] - box[1])) 
rect_height = int (np.linalg.norm(box[0] - box[3])) 
if rect_width <= 3 or rect_height <= 3: 
continue 

dt_boxes_new. append (box) 

dt_boxes = np.array (dt_boxes_new) 


return dt_boxes 


def filter_tag_det_res_only_clip(self, dt_boxes, image_shape): 

# Limit the boundary of the detection result 

img_height, img_width = image_shape[0:2] 

dt_boxes_new = [] 

for box in dt_boxes: 
box = self.clip_det_res (box, img_height, img_width) 
dt_boxes_new. append (box) 

dt_boxes = np.array (dt_boxes_new) 

return dt_boxes 


deft Ss scallie (see, mg): 


ori_im = img.copy() 
data = {'image': img} 
st = time.time() 


# Data preprocessing 
data = transform(data, self.preprocess_op) 
img, shape_list = data 
if img is None: 
return None, 0 
# Extended bs dimension: CHW -> NCHW 
img = np.expand_dims(img, axis=0) 
shape_list = np.expand_dims(shape_list, axis=0) 
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img = img.copy () 

# Copy the data to the inferenc ngin 
self.input_tensor.copy_from_cpu (img) 

# Automatic inference 
self.predictor.run() 

outputs = [] 


# Copy the returned result from the inference engine to the CPU 


for output_tensor in self.output_tensors: 
output = output_tensor.copy_to_cpu() 
outputs.append (output) 


preds = {} 

if scltedetealgorithmein || YDB. uP S heel 
preds['maps'] = outputs[0] 

else: 
raise NotImplementedError 


# Post-processing 


post_result = self.postprocess_op(preds, shape_list) 
dt_boxes = post_result[0]['points'] 

dt_boxes = self.filter_tag_det_res(dt_boxes, ori_im.shape) 
t = time.time() 

return dt_boxes, et - st 


# Set parameters 

args = parse_args() 

args.det_model_dir = "./inference/ch_PP-OCRv2_det_infer" 
args.image_dir = "./doc/imgs/00018069.jpg" 


# Get the picture list 

image_file_list = get_image_file_list (args.image_dir) 
# Create a text detector 

text_detector = TextDetector (args) 


count = 0 
total_time = 0 
draw_img_save = "./inference_results" 


if not os.path.exists (draw_img_save): 
os.makedirs (draw_img_save) 
save_results = [] 
for image_file in image_file_list: 
img = cv2.imread(image_file) 
if img is None: 
logger.info("error in loading image: {}".format (image_file) ) 
continue 
st = time.time() 
dt_boxes, _ = text_detector (img) 
lapse = time.time() - st 
LE count, = 0: 
total_time += elapse 
count += 1 
save_pred = os.path.basename(image_file) + "\t" + str( 


json.dumps (np.array (dt_boxes) .astype(np.int32).tolist())) + 


save_results.append(save_pred) 
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logger.info (save_pred) 


logger.info("The predict time of {}: {}".format (image_file, elapse) ) 
src_im = utility.draw_text_det_res(dt_boxes, image_file) 
img_name_pure = os.path.split (image_file) [-1] 
img_path = os.path.join(draw_img_save, 

"det_res_{}".format (img_name_pure) ) 
cv2.imwrite(img_path, src_im) 
logger.info("The visualized image saved in {}".format (img_path) ) 
break 


with open(os.path.join(draw_img_save, "det_results.txt"), 'w') as f: 
f.writelines (save_results) 
£.close() 


plt.figure(figsize=(10, 10)) 
(Ole n atnsleveNy (eae! _almn|)s, 84 2 3=il]))) 
plt.show() 


The following information will be printed. 


[2021/12/25 20:48:47] root INFO: 00018069.jpg [Lisvs, 249), 


The following figure will be displayed. 


oaaan 
Are Abe 


This is the whole text detection process. 
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7.2.3 Inference of PP-OCRv2 Direction Classifier Model 


Similarly, we can also use the following commands to quickly experience the direction classifier model. 


# Download model 

'cd inference && wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile 
oVi2, UnCclsminker tar —©) .Cchappoeramobtilenve, (mel shinier. tar Gs ican — xk che ppocramobale 
=n) 5 (0) Culis} ations(ete slexeie 


# Inference 

!python tools/infer/predict_cls.py \ 
--image_dir="./doc/imgs_words/ch/word_1.jpg" \ 
-—-cls_model_dir="./inference/ch_ppocr_mobile_v2.0_cls_infer" \ 

use_gpu=Fals 

# Draw the image 

img = cv2.imread("./doc/imgs_words/ch/word_1.jpg") 

juke pasion (shine 7 ey B= I) 

plt.show() 


--2021-12-25 15:40:13-- https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_ 
smobile v2.0 _cls_infer.tar 

Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 100.67.200.6 
Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |100.67.200.6|:443.. 
os. connected. 


HTTP request sent, awaiting response... 200 OK 

Length: 1454080 (1.4M) [application/x-tar] 

Saving to: ‘ch_ppocr_mobile_v2.0_cls_infer.tar’ 

ch_ppocr_mobile_v2. 100%[ >] 139M ==. KB/s in 0.04s 
2021-12-25 15:40:13 (338.6 MB/s) -— “ch ppocr_mobile v2.0_cls_ infer.tar’ savedu 


+ [1454080/1454080] 


[2021/12/25 15:40:15] root INFO: Predicts of ./doc/imgs_words/ch/word_1.jpg:['0',u 
o0.9998784] 


0 


0 50) 100 150 200 250 300 350 


The picture is horizontal and its text is not reversed, and the prediction result is correct. 


The implementation code of the direction classifier is as follows. 


# Reference Code 
# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/tools/infer/predict_cls. 


Sy, 
import copy 


# Implementation of the direction classifier 
class TextClassifier (object): 
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def 
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—aligslie. (GUE. Eweois))) & 
self.cls_image_shape = [int(v) for v in args.cls_image_shape.split(",")] 
self.cls_batch_num = args.cls_batch_num 
self.cls_thresh = args.cls_thresh 
postprocess_params = { 
"name': 'ClsPostProcess', 
"label_list": args.label_list, 
} 
# Post-—processing operator 
self.postprocess_op = build_post_process (postprocess_params) 
# Initialize the inference engine 
self.predictor, self.input_tensor, self.output_tensors, _ = \ 
utility.create_predictor(args, 'cls', logger) 


# Resize and normalize the image 


def 


def 


resize_norm_img(self, img): 

imgC, imgH, imgW = self.cls_image_shape 
h = img.shape[0] 

w = img.shape[1] 

ratio = w / float (h) 

if math.ceil(imgH * ratio) > imgW: 


resized_w = imgW 
else: 

resized_w = int(math.ceil(imgH * ratio) ) 
resized_image = cv2.resize(img, (resized_w, imgH) ) 
resized_image = resized_image.astype('float32"') 
if self.cls_image_shape[0] == 1: 

resized_image = resized_image / 255 

resized_image = resized_image[np.newaxis, :] 
else: 

resized_image = resized_image.transpose((2, 0, 1)) / 255 
resized_image -= 0.5 


resized_image /= 0.5 

padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32) 
padding_im[:, :, O:resized_w] = resized_imag 

return padding_im 


ecalll a (Sseuke,. eimemlbaesite) ce 
img_list = copy.deepcopy (img_list) 


img_num = len(img_list) 
# Record the aspect ratio 
width_list = [] 


for img in img list: 
width_list.append(img.shape[1] / float (img.shape[0])) 
# Sort and speed up the subsequent pre-processing 


indices = np.argsort (np.array (width_list) ) 

Gls mdss: = (YY 7 WO] |) =e abaonsotteien 

batch_num = self.cls_batch_num 

elapse = 0 

for beg_img_no in range(0, img_num, batch_num): 
end_img_no = min(img_num, beg_img_no + batch_num) 
norm_img_batch = [] 
max_wh_ratio = 0 
starttime = time.time() 


# Preprocess data, and group batches 
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for ino in range (beg_img_no, end_img_no): 
h, w = img_list[indices[ino]].shape[0:2] 
wh_ratio = w* 1.0 / h 
max_wh_ratio = max(max_wh_ratio, wh_ratio) 
for ino in range(beg_img_no, end_img_no): 


norm_img = self.resize_norm_img(img_list [indices[ino]]) 


norm_img = norm_img[np.newaxis, :] 
norm_img_batch. append (norm_img) 
norm_img_batch = np.concatenate (norm_img_batch) 
norm_img_batch = norm_img_batch.copy () 
# Copy the data to the inference engine 
self.input_tensor.copy_from_cpu (norm_img_batch) 
# Automatically perform the inference 
self.predictor.run() 
# Copy the data back to the CPU 
prob_out = self.output_tensors[0].copy_to_cpu() 
# Post-—process 


cls_result = self.postprocess_op (prob_out) 
lapse += time.time() - starttime 
for rno in range(len(cls_result)): 
label, score = cls_result[rno] 
cls_res[indices[beg_img_no + rno]] = [label, score] 
if '180' in label and score > self.cls_thresh: 
img_list [indices [beg_img_no + rno]] = cv2.rotate( 


img_list [indices [beg_img_no + rno]], 1) 
return img_list, cls_res, elapse 


args = parse_args() 
args.cls_model_dir = "./inference/ch_ppocr_mobile_v2.0_cls_infer" 
args.image_dir = "./doc/imgs_words/ch/word_4.jpgq" 


image_file_list = get_image_file_list (args.image_dir) 
text_classifier = TextClassifier (args) 
valid_image_file_list = [] 


img_list = [] 
for image_file in image_file_list: 
img = cv2.imread(image_file) 


# Rotate the image 180 degrees before inferenc 
# img = cv2.rotate(img, cv2.ROTATE_180) 
if img is None: 
logger.info("error in loading image: {}".format (image_file) ) 
continue 
valid_image_file_list.append(image_file) 
img_list.append (img) 
img_list, cls_res, predict_time = text_classifier(img_list) 
for ino in range(len(img_list)): 
logger.info("Predicts of {}:{}".format (valid_image_file_list [ino], 
cls_res[ino])) 


PER emisihowa(mes [syns te) 
plt.show() 


[2021/12/25 15:43:28] root INFO: Predicts of ./doc/imgs_words/ch/word_4.jpg:['0',u 


30.9999982] 
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Here we can also rotate the image 180 degrees to see the classification result of the direction classifier. 


This is the inference process of the direction classifier. 


7.2.4 Inference of PP-OCRv2 Text Recognition Model 


It is feasible to use the following commands to quickly experience the functions of the text recognition model. 


# Download the model 
'cod inference && wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_ 
rec_infer.tar -O ch_PP-OCRv2_rec_infer.tar && tar -xf ch_PP-OCRv2_rec_infer.tar 
# Inference 
'python tools/infer/predict_rec.py \ 
--image_dir="./doc/imgs_words/ch/word_4.jpg" \ 
rec_model_dir="./inference/ch_PP-OCRv2_rec_infer" \ 
use_gpu=Fals 


# Read the image and display it 
img = cv2.imread("./doc/imgs_words/ch/word_4.jpg") 
jolec .AsMisaony (abmney|| By Bk eal ]|) 


plt.show() 
—-2021-12-25 15:43:40-- https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP- 
sOCRv2_rec_infer.tar 
Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 100.67.200.6 


Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |100.67.200.6|:443.. 
os. connected. 

HTTP request sent, awaiting response... 200 OK 

Length: 8875520 (8.5M) [application/x-tar] 

Saving to: “ch PP-OCRv2_réec_infer.tar’ 


ch_PP-OCRv2_rec_inf 100%[ >] 8.46M --.-KB/s amy Olas 
2021-12-25 15:43:40 (64.5 MB/s) - ‘ch_PP-OCRv2_rec_infer.tar’ saved [8875520/ 
38875520] 


[2021/12/25 15:43:42] root INFO: Predicts of ./doc/imgs_words/ch/word_4.jpg: ('ARRA 
te, Ose 9401951851) 
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The code for the text recognition is shown below. 


# Reference Code 
# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/tools/infer/predict_rec. 
opy 
class TextRecognizer (object): 
deft ina ae (seit args): 
self.rec_image_shape = [int(v) for v in args.rec_image_shape.split(",")] 
self.rec_batch_num = args.rec_batch_num 
self.rec_algorithm = args.rec_algorithm 
postprocess_params = { 
"name': 'CTCLabelDecode', 
Weharacteracdictmpabh sands. secucharadlCempatl, 
"use_space_char": args.use_space_char 


} 

# Initialize the inference engine 

self.predictor, self.input_tensor, self.output_tensors, self.config = \ 
utility.create_predictor(args, 'rec', logger) 

# Build Post—processing 

self.postprocess_op = build_post_process (postprocess_params) 


# Pre-process the core logic 

def resize_norm_img(self, img, max_wh_ratio): 
imgC, imgH, imgW = self.rec_image_shap 
assert imgC == img.shape[2] 
imgW = int((32 * max_wh_ratio) ) 
h, w = img.shape[:2] 
ieeyeste: = yy jf aelloyene, (Hah) 
if math.ceil(imgH * ratio) > imgW: 


resized_w = imgW 
else: 

resized_w = int(math.ceil(imgH * ratio) ) 
resized_image = cv2.resize(img, (resized_w, imgH) ) 
resized_image = resized_image.astype('float32') 
2 (HOR. Aero es TO ly) 
resized_image = resized_image.transpose((2, 0, 1)) / 255 
He EO skip =e aOni 10) .6u) 
resized_image -= 0.5 
ie (=Ogay, Osi) =e fal, ay) 
resized_image /= 0.5 


padding_im = np.zeros((imgC, imgH, imgW), dtype=np.float32) 
padding_im[:, :, O:resized_w] = resized_imag 
return padding_im 


# Process the image list 
det ecole (Sekt ein cmelaiesite))e: 
img_num = len(img_list) 
# Record aspect ratio 
(continues on next page) 
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width_list = [] 
for img in img list: 

width_list.append(img.shape[1] / float (img.shape[0])) 
# Sort and speed up the process 


indices = np.argsort (np.array (width_list) ) 
rec_res = [['', 0.0]] * img_num 
batch_num = self.rec_batch_num 
st = time.time() 
for beg_img_no in range(0, img_num, batch_num): 
end_img_no = min(img_num, beg_img_no + batch_num) 
norm_img_batch = [] 
max_wh_ratio = 0 
for ino in range (beg_img_no, end_img_no): 
h, w = img_list[indices[ino]].shape[0:2] 
wh_ratio = w* 1.0 /h 
max_wh_ratio = max(max_wh_ratio, wh_ratio) 


# Call preprocessing method and group batch 
for ino in range (beg_img_no, end_img_no): 
norm_img = self.resize_norm_img(img_list[indices[ino]], 
max_wh_ratio) 
norm_img = norm_img[np.newaxis, :] 
norm_img_batch. append (norm_img) 
norm_img_batch = np.concatenate (norm_img_batch) 
norm_img_batch = norm_img_batch.copy () 


# Copy the data to the prediction engine 
self.input_tensor.copy_from_cpu (norm_img_batch) 
# Automated inference process 
self.predictor.run() 
outputs = [] 
# Copy data to CPU 
for output_tensor in self.output_tensors: 
output = output_tensor.copy_to_cpu () 
outputs.append (output) 
if len(outputs) != 1: 
preds = outputs 
else: 
preds = outputs[0] 
# Post-—process 


rec_result = self.postprocess_op (preds) 
for rno in range(len(rec_result)): 
rec_res [indices [beg_img_no + rno]] = rec_result[rno] 
return rec_res, time.time() - st 


# Define parameters 
args = parse_args() 


args.rec_model_dir = "./inference/ch_PP-OCRv2_rec_infer" 
args.image_dir = "./doc/imgs_words/ch/word_4.jpg" 
img_list = [] 


image_file_list = get_image_file_list (args.image_dir) 
text_recognizer = TextRecognizer (args) 
valid_image_file_list = [] 
for image_file in image_file_list: 

img = cv2.imread(image_file) 


(continues on next page) 
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(continued from previous page) 
if img is None: 
logger.info("error in loading image: {}".format (image_file) ) 
continue 
valid_image_file_list.append(image_file) 
img_list.append (img) 


rec_res, _ = text_recognizer(img_list) 
for ino in range(len(img_list)): 
logger.info("Predicts of {}:{}".format (valid_image_file_list [ino], 


rec_res[ino])) 


[2021/12/25 15:51:51] root INFO: Predicts of ./doc/imgs_words/ch/word_4.jpg: ('RRAL 
me, Oe I4 095615) 


7.2.5 End-to-end Inference of PP-OCRv2 System 


The previous parts have introduced the separate inference processes of the detection model, the direction classifier, and 
the recognition model in the PP-OCRv2 system. To provider more convenience for end-users, we have connected these 
three modules to form the PP-OCRv?2 system, and provided the inference script. 


When performing system inference of PP-OCRv2, you need to designate the path of a single image or an image collection 
through the parameter image_dir, and designate the inferencemodel path of the detection, the direction classifier, 
and the recognition through the parameters det_model_dir, cls_model_dir and rec_model_dir respec- 
tively. The parameter use_angle_cls is used to control whether to enable the direction classifier model. use_mpo 
indicates whether to use multiple processes. tot al_process_num indicates the number of processes. 


Take the image file ./doc/imgs/00018069.4Jpg as an example, and the inferred original image is as follows. 


CS TA ek 2S Bt 
ALT ts Sha 25.6 j--40 U/L 
TBIL a ABeT ieee £30) umol /L 
DBIL Bie hhelS 55 O--F | umol /L 
IBIL [ale ABeT = 7.9 | 1.5--15) umol /L 
TP AeA 58.94 60-80 g/L 
ALB Ate A 35.1 33--55 g/L 
GLO Beis A 23,8 20--30 g/L 
A/G AbEtt, 1.5 
ALF feet Pas BS 93 15--112 IU/L 
GGT fF SLATES AABS 14.3 <50 UAL 
AST ates fa 16.3 8-—d0) U/L 
LDH 7 LBS in = Bs 167 114-940 = VL 
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Figure 3: The original image 
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If you use a direction classifier for end-to-end inference, you can use the following command to make inference. 


# Use the direction classifier to run the PP-OCRv2 system 
'python tools/infer/predict_system.py \ 
-—-image_dir="./doc/imgs/00018069.jpg" \ 
det_model_dir="./inference/ch_PP-OCRv2_det_infer/" \ 
——clsemodelladir—'s/intexcence/ chy ppocr mobile v2,0nclsuinter/™ \ 
rec_model_dir="./inference/ch_PP-OCRv2_rec_infer/" \ 
use_angle_cls=True 


# Visualize 

img = cv2.imread("./inference_results/00018069. jpg") 
plt.figure(figsize=(20, 8)) 

joule qamsiavony (sane lloaay S8—ab)l) 

plt.show() 


The visualization result is saved in the ./inference_results folder by default, which is as shown below. 
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The detection frame and the recognition result are visualized in the image, and the specific recognized file and the file 
reading the path information are also printed in the above notebook. 


If you want to save the cropped recognition results, set the save_crop_res parameter to True, and the final result is 


saved in the output directory. Some of the cropped images are shown below. The saved results can be used for the 
coming labeling and training of recognition models. 


# Tailor the result image of text detection and save it 
‘python tools/infer/predict_system.py \ 
—-image_dir="./doc/imgs/00018069.jpg" \ 
det_model_dir="./inference/ch_PP-OCRv2_det_infer/" \ 
——clsemodeladir—"./interence/chappocr mobi lenv2,0nclsminter/™ \ 
rec_model_dir="./inference/ch_PP-OCRv2_rec_infer/" \ 
use_angle_cls=True \ 
=—=Ssave_ crop _res-—iIrue 


is eyblejeiohe 


.figure(figsize=(8, 8)) 

mimshow(ieviz .mread(('./doc/imgs/ 000LSU69 = spo") ls, 3, 2a 1)) 
. show () 

.figure(figsize=(14, 4)) 

7SUbp lots) 

TumMshow (evened (Momcepuby/momer opm Ong OGis)i litmecen i lly) 

eq siullojsnlione (7 Sha) 


"0°00" 00 
Ea aces ae 
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plt-imshow(ev2.amread ("output /mg crop l-gpagt)! [3 =i) 
joule aisulloy oullione: (Al... SiS))) 

Pityamshow (eve aimrcad (Houcpub/Mmgmethope ol OO) n lei ncey =1]) 
plt.show() 


The original image is as follows. 
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The end-to-end inference is implemented by Text System, and the process and definition of the function are as follows. 


# Reference Codefyhttps://github.com/PaddlePaddle/PaddleOCR/blob/releaset2F2.4/tools/ 
oinfer/predict_system.py 

from tools.infer.utility import draw_ocr_box_txt, get_rotate_crop_image 

from ppocr.utils.utility import get_image_file_list 


class TextSystem (object): 
# Initialize the function 
Clee _ saligebie  (SGulie,  eeros})) & 
self.args = args 
# If you do not want to display log, you can set show_log to False 


(continues on next page) 
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if not args.show_log: 

logger.setLevel (logging. INFO) 
# Define the text recognition model inferenc ngine 
self.text_detector = TextDetector (args) 
# Define the text recognition model inference engine 
self.text_recognizer = TextRecognizer (args) 
# Use the direction classifier or not 
self.use_angle_cls = args.use_angle_cls 
# Score the threshold, and determine whether the detection and recognition. 

oresults need to be visualized or returned 

self.drop_score = args.drop_score 
# Define the direction classifier inference engine 
if self.use_angle_cls: 

self.text_classifier = TextClassifier (args) 


# Save the result image in the text detection 
def draw_crop_rec_res(self, output_dir, img_crop_list, rec_res): 
os.makedirs(output_dir, exist_ok=True) 
bbox_num = len(img_crop_list) 
for bno in range (bbox_num) : 
cv2.imwrite ( 
Osi. Path. Join (CuEpUES dine, 
f"mg_crop_{bnotself.crop_image_res_index}.jpg"), 
img_crop_list[bno] ) 
logger.debug(f"{bno}, {rec_res[bno] }") 
self.crop_image_res_index += bbox_num 


# Core inference function 
def _call__(self, img, cls=True): 


ori_im = img.copy() 

# Get the detection result of the detected text 

dt_boxes, elapse = self.text_detector (img) 

logger.debug("dt_boxes num: {}, elapse : {}".format ( 
len(dt_boxes), elapse) ) 


if dt_boxes is None: 
return None, None 
abe ereojey Jlaksie = [I] 
# Sort the detection boxes, and the order is: from top to bottom, and from. 
Sens GO Lage 
dt_boxes = sorted_boxes (dt_boxes) 
# Perform perspective transformation and correction on the detection result 
for bno in range(len(dt_boxes) ): 
tmp_box = copy.deepcopy (dt_boxes [bno] ) 
img_crop = get_rotate_crop_image(ori_im, tmp_box) 
img_crop_list.append (img_crop) 
# Use the direction classifier to correct the detection result 
if self.use_angle_cls and cls: 


IMNGacrops lusts) angle Iasi, lapse = self.text_classifier ( 
img_crop_list) 
logger.debug("cls num : {}, elapse : {}".format ( 
len(img_crop_list), elapse) ) 
# Get the text recognition result 
rec_res, elapse = self.text_recognizer (img_crop_list) 
logger.debug("rec_res num : {}, elapse : {}".format ( 


len(rec_res), lapse) ) 
# Save the corrected images in the text detection 


(continues on next page) 
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if self.args.save_crop_res: 
self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list, 
rec_res) 
fillteruboxes;,, filtersrecures = [lll 
#Filter the results according to the threshold of the recognition score, and. 
if the score is less than the threshold, filter it out 
for box, rec_reuslt in zip(dt_boxes, rec_res): 
text, score = rec_reuslt 
if score >= self.drop_score: 
filter_boxes.append (box) 
filter_rec_res.append(rec_reuslt) 
return filter _boxes, filter _rec_res 


def sorted_boxes (dt_boxes): 
# Sort the detection boxes: first from top to bottom, then from left to right 
num_boxes = dt_boxes.shape[0] 
sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0])) 
_boxes = list (sorted_boxes) 


for i in range(num_boxes - 1): 
if abs (_boxes[i + 1] [0] [1] - _boxes[i][0][1]) < 10 and \ 
(Ghoxesieiy aa) i lO MON <eeb osces: ea] ) (ON ELON EE 
tmp = _boxes[i] 
_boxes[i] = _boxes[i + 1] 
_boxes[i + 1] = tmp 
return _boxes 


args = parse_args() 


args.cls_model_dir = "./inference/ch_ppocr_mobile_v2.0_cls_infer" 
args.det_model_dir="./inference/ch_PP-OCRv2_det_infer/" 
args.rec_model_dir="./inference/ch_PP-OCRv2_rec_infer/" 
args.image_dir = "./doc/imgs/00018069. jpg" 


args.use_angle_cls=True 
args.use_gpu=True 


image_file_list = get_image_file_list (args.image_dir) 

image_file_list = image_file_list[args.process_id::args.total_process_num] 
text_sys = TextSystem(args) 

is_visualize = True 

font_path = args.vis_font_path 

drop_score = args.drop_score 


total_time = 0 
cpu_mem, gpu_mem, gpu_util = 0, 0, 0O 
_st = time.time() 
count = 0 
for idx, image_file in enumerate (image_file_list): 

img = cv2.imread(image_file) 

if img is None: 

logger.debug("error in loading image: {}".format (image_file) ) 


continue 
starttime = time.time() 
dt_boxes, rec_res = text_sys (img) 
lapse = time.time() - starttime 


total_time += elapse 
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score in rec_res: 
logger.debug("{}, 


draw_ocr_box_txt ( 


drop_score=drop_score, 
font_path=font_path) 


draw_img_save_dir = 


cv2.imwrite ( 


os.path.join(draw_img_save_dir, 
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args.draw_img_save_dir 
os.makedirs (draw_img_save_dir, 


(imag 


Seals 


(continued from previous page) 


score) ) 


’ 


lapse) ) 


cv2.COLOR_BGR2RGB) ) 


exist_ok=True) 


os.path.basename (image_fil 


for i in range(len(rec_res) ) ] 
for i in range(len(rec_res) ) ] 


logger.debug ("The visualized image saved in {}".format ( 


os.path.join(draw_img_save_dir, 


os.path.basename (image_fil 


)), 


ee) 


logger.info("The predict total time is {}".format (time.time() - _st)) 

plt.figure(figsize=(8, 8)) 

plt.imshow (image) 

plt.show() 

plt.figure(figsize=(16, 8)) 

plt.imshow (draw_img) 

plt.show() 

The inference result is as follows. 
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7.2.6 inference using WHL Package in PP-OCRv2 


To get started with the OCR text detection and recognition model more conveniently, PaddleOCR provides a whl package 
based on the Paddle Inference inference engine. You can experience PaddleOCR after one-click installation. 


Installing whl Package 


Use pip to install the whl package of PaddleOCR, and the command is as follows. 
‘pip install "paddleocr==2.3.0.2" 


# If you want to get the latest features, you can install it based on compiling ofu 
She source: code 

# python3 setup.py bdist_wheel 

# pip3 install dist/paddleocr-x.x.x-py3-none-any.whl # x.x.x Is the version. 
onumber of paddleocr 


Using whl Package for Inference 


The PaddleOCR whl package will automatically download the PP-OCRv?2 ultra-lightweight model as the default model. 
It also supports custom model paths, inference configurations and other parameters. The parameter names are the same 
as those in the Python inference in Paddle Inference. 


¢ Perform Testing Separately 


Run the following code to quickly experience the inference of the text detection model. 


from paddleocr import PaddleOCR, draw_ocr 


ocr = PaddleOCR(use_gpu=False) # need to run only once to download and load modelui 
einto memory 
img_path = '/home/aistudio/PaddleOCR/doc/imgs/11.jpg' 
result = ocr.ocr(img_path, rec=False) 
for line in result: 
print (line) 


# Display the result 
from PIL import Image 


image = Image.open(img_path) .convert ('RGB') 

im_show = draw_ocr(image, result, txts=None, scores=None, font_path='/home/aistudio/ 
oPaddleOCR/doc/fonts/simfang.ttf') 

plt.figure(figsize=(15, 8)) 

plt.imshow (im_show) 

plt.show() 


The output is as follows. 


[[27.0, 459.0], [135.0, 459.0], [135.0, 479.0], [27.0, 479.0]] 
[[29.0, 431.0], [369.0, 431.0], [369.0, 444.0], [29.0, 444.0]] 
[[26.0, 397.0], [361.0, 397.0], [361.0, 414.0], [26.0, 414.0]] 
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¢ Perform identification separately 


You can specify det=False to run only a single recognition module. 
from paddleocr import PaddleOCR 


ocr = PaddleOCR(use_gpu=False) # need to run only once to download and load model. 
ointo memory 
img_path = '/home/aistudio/PaddleOCR/doc/imgs_words/ch/word_1.jpg' 
result = ocr.ocr(img_path, det=False) 
for line in result: 
print (line) 


# expected output: ('MAAy', 0.9967349) 


¢ Separately execute the direction classifier 


You can designate det=False, rec=False, cls=True and only run the direction classifier. 
from paddleocr import PaddleOCR 
ocr = PaddleOCR(use_angle_cls=True, use_gpu=False) # need to run only once tou 


odownload and load model into memory 
img_path = '/home/aistudio/PaddleOCR/doc/imgs_words/ch/word_1.jpg' 
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result = ocr.ocr(img_path, det=False, rec=False, cls=True) 
for, Mine an result: 
print (line) 


img = cv2.imread(img_path) 
PUG 4 sSlavoNN (SINE 6 o.0 75881) 
plt.show() 


# expected output: ['0', 0.9998784] 


¢ Experience the whole process of *Detection + direction classifier + recognition 


from paddleocr import PaddleOCR, draw_ocr 
import matplotlib.pyplot as plt 
smatplotlib inline 


# PaddleOCR currently supports many languages, including Chinese, English, French, 
«German, Korean, Japanese, and you can switch the mode by modifying the langu 
«parameter 

# Their parameters are ‘ch’, ‘en’, ‘french’, ‘german’, ‘korean’, °japan’. 

ocr = PaddleOCR(use_angle_cls=True, lang="ch", use_gpu=False) # need to run onlyu 
sonce to download and load model into memory 

img_path = '/home/aistudio/PaddleOCR/doc/imgs/11.jpg' 

result = ocr.ocr(img_path, cls=True) 

for line in result: 

print (line) 


# Display the result 
from PIL import Image 


image = Image.open(img_path) .convert ('RGB') 
boxes = [line[0] for line in result] 

txts = [line[1][0] for line in result] 
scores = [line[1][1] for line in result] 


im_show = draw_ocr(image, boxes, txts, scores, font_path='/home/aistudio/PaddleOCR/ 
«doc/fonts/simfang.ttf£') 

plt.figure(figsize=(15, 8)) 

plt.imshow (im_show) 

plt.show() 


pre 


Expected output: 


PMIZS 07 ATS S0i, T3305 0; 11S 0), WS80.0,7 13930), [2810, 132 10) fi, \( We 45 n/eiaad OO RAIA 2, 
e 0.90023524) J 

ZO; WAS Cy. (eG Ohs her Oi Ieee 0 Ot Oi, TAS WES sO, (TAREE OOO mIRIE;S. Ole 
49793598) | 


Tera, 


The result is a list, within which each item contains a text box, a text and recognition confidence. 
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7.3 C++ Inference Based on Paddle Inference 


In the inference deployment, the performance of C++ is better than Python. Therefore, in many scenarios, C++ is chosen 
as the development language for inference. 


The Paddle Inference introduced in the previous section also supports the C++ inference. This section mainly talks about 
the C++ PP-OCRv?2 inference. 


In the C++ inference on the PP-OCRv2 system based on Paddle Inference, there are several steps: 
(1) Preparing the model 

(2) Compiling the opencv library 

(3) Getting Paddle Inference prediction library 

(4) Compiling PaddleOCR C++ inference code 

(5) Running the PP-OCRv2 system 


Limited by the version on AiStudio, the process will not be domonstrated but only explained here. It is recommended 
that you experience the C++ inference process of PP-OCRv2 locally. 


For more details about this section, please refer to: PP-OCRv2 C++ Inference Tutorial. 


7.3.1 Preparaing Model 


Use the following command to prepare the inference model of PP-OCRv2. 


cd deploy/cpp_infer 

wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_infer.tar -O ch_ 
«PP-OCRv2_det_infer.tar && tar -xf ch_PP-OCRv2_det_infer.tar 

wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_infer.tar -O ch_ 
oPP-OCRv2_rec_infer.tar && tar -xf ch_PP-OCRv2_rec_infer.tar 

wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer. 
otar -O ch_ppocr_mobile_v2.0_cls_infer.tar && tar -xf ch_ppocr_mobile_v2.0_cls_infer. 
otar 


7.3.2 Compiling OpenCV Library 


¢ First, download the package compiled in the Linux environment from the opencv’s official website. Take opencv 
3.4.7 as an example, and its download command is as follows. 


wget https://paddleocr.bj.bcebos.com/libs/opencv/opencv-3.4.7.tar.gz 
tar -xf opencv-3.4.7.tar.gz 


Finally, you can find the folder opencv—3.4.7/ in the current directory. 


¢« Compile OpenCV, set OpenCV root_path and install_path. Enter into the path of the opencv source 
code and compile it in the following way. 


root_path="your_opencv_root_path" 
install_path=${root_path}/opencv3 
build_dir=S{root_path}/build 


rm —rf S{build_dir} 
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mkdir ${build_dir} 
ed S{build_dir} 


cmake .. \ 
-DCMAKE_INSTALL_PREFIX=S{install_path} \ 
—DCMAKE_BUILD_TYPE=Release \ 
-DBUILD_SHARED_LIBS=OFF \ 
-DWITH_IPP=OFF \ 
-DBUILD_IPP_IW=OFF \ 
-DWITH_LAPACK=OFF \ 
-DWITH_EIGEN=OFF \ 
-DCMAKE_INSTALL_LIBDIR=1lib64 \ 
-DWITH_ZLIB=ON \ 
-DBUILD_ZLIB=ON \ 
-DWITH_JPEG=ON \ 
-DBUILD_JPEG=ON \ 
—DWITH_PNG=ON \ 
—DBUILD_PNG=ON \ 
-DWITH_TIFF=ON \ 
—DBUILD_TIFF=ON 

make —j 


make install 


You can also modify the content of tools/build_opencv. sh, and then run the following command for compiling. 


sh tools/build_opencv.sh 


root_path is the source code path of the downloaded opencv, and install_path is the opencv’s installation path. 
After make install, header files and library files of opencv will be generated in this folder for the coming OCR code 
compilation. 


The file structure in the installation path is shown below. 


opencv3/ 
[=— Bin 

|-— include 
|-- lib 

|-- 1lib64 
|-- share 


7.3.3 Downloading the Inference Library of Paddle Inference 


¢ The Linux prediction libraries with different cuda versions are available on Paddle Inference library official website. 
You can choose the version on the official website according to your own environment. 


e After downloading, use the following method for decompression. 


wget https://paddle-inference-lib.bj.bcebos.com/2.2.1/cxx_c/Linux/GPU/x86-64_gcc8.2_ 
savx_mkl_cudai0.2_cudnn8.1.1_trt7.2.3.4/paddle_inference.tgz -O paddle_inference.tgz 
tar -xf paddle_inference.tgz 


A subfolder of paddle_inference/ will be generated in the current folder. 


7.3. C++ Inference Based on Paddle Inference 203 


Dive into OCR 


7.3.4 Compiling the Inference Code of PaddleOCR 


The compilation command is as follows, and the addresses of Paddle C++ inference library opencv, and other dependent 
libraries need to be replaced with the real addresses on your own computer. 


sh tools/build.sh 


¢ You need to modify the environment path in tools/build.sh: 


OPENCV_DIR=your_opencv_dir 
LIB_DIR=your_paddle_inference_dir 
CUDA_LIB_DIR=your_cuda_lib_dir 
CUDNN_LIB_DIR=/your_cudnn_lib_dir 


OPENCV_DIR is the address where opencv is compiled and installed; LIB DIR is the address where 
the Paddle inference library is downloaded (the paddle_inference folder) or compiled (build/ 
paddle_inference_install_dir folder); CUDA_LIB_DIR is the address of the cuda library file, and it 
is /usr/local/cuda/1ibé64 in the docker; UDNN_LIB_DIR is the address of the cudnn library file, and it is 
/usr/lib/x86_64-linux-—gnu/in the docker. Note: The above paths should as absolute paths instead of 
relative paths. 


e After the compilation, an executable file named ppocr will be generated in the build folder. 


7.3.5 Running the PP-OCRv2 System 


Running modef 


./build/ppocr <mode> [--param1] [--param2] [...] 


mode is a required parameter, which represented the selection function, and the value range is [‘det’,’rec’,’system’], which 
represent calling detection, recognition, and end-to-end detection and recognition (including the direction classifier) re- 
spectively. The commands are as follows: 


¢ Run the text detection model only 


./build/ppocr det \ 
—-det_model_dir=./ch_PP-OCRv2_det_infer/ \ 
-—-image_dir=../../doc/imgs/12. jpg 


¢ Run the text detection model only 


./build/ppocr rec \ 
-—-rec_model_dir=./ch_PP-OCRv2_rec_infer/ \ 
-—-image_dir=../../doc/imgs_words/ch/ 


¢ Run the PP-OCRv2 system 


# Do not use the direction classifier 
./build/ppocr system \ 
det_model_dir=./ch_PP-OCRv2_det_infer/ \ 
-—-rec_model_dir=./ch_PP-OCRv2_rec_infer/ \ 
--image_dir=../../doc/imgs/12. jpg 


# Use the direction classifier 
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./build/ppocr system \ 
-—-det_model_dir=./ch_PP-OCRv2_det_infer/ \ 
—-rec_model_dir=./ch_PP-OCRv2_rec_infer/ \ 
--use_angle_cls=true \ 
--cls_model_dir=./ch_ppocr_mobile_v2.0_cls_infer \ 
--image_dir=../../doc/imgs/12.jpg 


7.4 Service Deployment using Paddle Serving 


In Sections 2 and 3, we have introduced the inference of the PP-OCRv2 system inference based on Paddle Inference, 
which is an offline inference where the code deployed on a specific machine can only be used on this machine and cannot 
be visited on other machines. Therefore, the demand for model service deployment is generated. 


In Service deployment, the model is deployed as a service, and other devices can access the service by sending a request 
to obtain the inference result of the model service. Its schematic diagram is shown below. 


Server -side/cloud 
computing center oe 


Figure 4: Schematic diagram of service deployment 


After the model is deployed, different users can get the inference service by sending requests as clients. 


Paddle Serving is a tool created by PaddlePaddle to help developers perform service deployment. This section is mainly 
about the service deployment of the PP-OCRv2 system with Paddle Serving. 
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7.4.1 Intorduction to Paddle Serving 


Paddle Serving is an open-source service deployment framework of PaddlePaddle. Its long-term goal is to provide more 
and more professional, reliable, and easy-to-use services to deal with the last mile of AI’s landing. Paddle Serving now 
provides two frameworks: C++ Serving and Python Pipeline. The Python Pipeline framework tends to facilitate the 
secondary development, and the C++ Serving framework tends to pursue high performance. 


When the PP-OCRv2 model is deployed as a service with Paddle Serving, the process is as follows. 


Figure 5: Flow chart of PP-OCRv2 system deployment based on Paddle Serving 


7.4.2 Preparaing Inference Data and Deployment Environment 


The data should be in consistent with those used for the model inference. 


Before running Paddle Serving, three packages of Paddle Serving need to be installed: paddle-serving-server, paddle- 
serving-client and paddle-serving-app. The commands are as follows. 


!wget https://paddle-serving.bj.bcebos.com/test—dev/whl/paddle_serving_server_gpu-0.7. 
o0.post102-py3-none-any.whl 
!pip install paddle_serving_server_gpu-0.7.0.post102-py3-none-any.whl 


'wget https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.7.0- 
«cp37-none-any.whl 


'pip install paddle_serving_client-—0.7.0-cp37-none-any.whl 


!wget https://paddle-serving.bj.bcebos.com/test—dev/whl/paddle_serving_app-0.7.0-py3- 
onone-any.whl 
!pip install paddle_serving_app-0.7.0-py3-none-any.whl 


eon 5 // o> oisdetl 
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7.4.3 Preparing for model deployment 


Before the model service deployment, first convert the inference model into a model for user service deployment. 


First run the following command to download the inference model. 
os.chdir("/home/aistudio/PaddleOCR/deploy/pdserving/") 


# Download and unzip the OCR text detection model 

'wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_infer.tar -OU 
och_PP-OCRv2_det_infer.tar && tar -xf ch_PP-OCRv2_det_infer.tar && rm ch_PP-OCRv2_ 
odet_infer.tar 

# Download and unzip the OCR text detection model 

'wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_infer.tar -OU 
ech _ PP-OCRV2_ rec _inferstar && tar -xf ch PP-OCRV2 rec _inferstar && rm ch PP-OCRV2_ 
orec_infer.tar 


—-2021-12-25 16:25:32-- https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP- 
sOCRv2_det_infer.tar 
Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 100.67.200.6 


Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |100.67.200.6|:443.. 
os. connected. 

HTTP request sent, awaiting response... 200 OK 

Length: 3190272 (3.0M) [application/x-tar] 

Saving to: ‘ch_PP-OCRv2_det_infer.tar’ 


ch_PP-OCRv2_det_inf 100%[ >] 3.04M --.-KB/s in 0.09s 
2021-12-25 16:25:32 (35.0 MB/s) - ‘ch_PP-OCRv2_det_infer.tar’ saved [3190272/ 
eSTIO272)] 

—-2021-12-25 16:25:33-- https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP- 
sOCRv2_rec_infer.tar 

Resolving paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com)... 100.67.200.6 


Connecting to paddleocr.bj.bcebos.com (paddleocr.bj.bcebos.com) |100.67.200.6|:443.. 
os. connected. 

HTTP request sent, awaiting response... 200 OK 

Length: 8875520 (8.5M) [application/x-tar] 

Saving to: ‘ch_PP-OCRv2_rec_infer.tar’ 


ch_PP-OCRv2_rec_inf 100%[ >] 8.46M --.-KB/s aligh (Oley werst 
2021-12-25 16:25:33 (69.0 MB/s) - ‘ch_PP-OCRv2_rec_infer.tar’ saved [8875520/ 
48875520] 


Run the command to convert the model. 


# Convert the detection model 
‘python -m paddle_serving_client.convert --dirname ./ch_PP-OCRv2_det_infer/ \ 
model_filename inference.pdmodel \ 
—-params_filename inference.pdiparams \ 
—-serving_server ./ppocrv2_det_serving/ \ 
—-serving_client ./ppocrv2_det_client/ 


# Convert the recognition model 
'python -m paddle_serving_client.convert --dirname ./ch_PP-OCRv2_rec_infer/ \ 
model_filename inference.pdmodel \ 
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—-params_filename inference.pdiparams \ 
—-serving_server ./ppocrv2_rec_serving/ \ 
——isieueniauiatep soullakernne, 4 //jejororoiaeA. selexe!_ollhalrsaney/ 


# View the folder 
‘tree -h *_client *_serving 


ppocrv2_det_client 


-— 296 serving_client_conf.prototxt 

a 98 serving_client_conf.stream.prototxt 
ppocrv2_rec_client 

—— 284 serving_client_conf.prototxt 

—— 93 serving_client_conf.stream.prototxt 


ppocrv2_det_serving 

[— [42M inference.pdiparams 

-— [842K inference.pdmodel 

-—— 296 serving_server_conf.prototxt 

— 98 serving_server_conf.stream.prototxt 
ppocrv2_rec_serving 

fs eM inference.pdiparams 

[— [fo27k inference.pdmodel 

—— 284 serving_server_conf.prototxt 

= 93 serving_server_conf.stream.prototxt 


OQ directories, 12 files 


After the detection model is converted, there will be additional folders of ppocrv2_det_mobile_serving and 
ppocrv2_det_mobile_client in the current folder, with the following format: 


|— ppocrv2_det_mobile_serving/ 

— _ model__ 

— __params__ 

— serving_server_conf.prototxt 

— serving_server_conf.stream.prototxt 


|- ppocrv2_det_mobile_client 
— serving_client_conf.prototxt 
— serving_client_conf.stream.prototxt 


The recognition model is the same. 


7.4.4 Deploying Paddle Serving pipeline 


Note: Modify the two model_config fields in the PaddleOCR/deploy/pdserving/config.yml file to 
ppocrv2_det_mobile_serving and ppocrv2_rec_mobile_serving respectively, to correspond to the folder of model 
conversion. 


The pdserving directory contains the code to start the pipeline service and sendu. 
sprediction requests, including: 


__init__.py 
config.yml # Start the configuration file of the service 
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ocr_reader.py # The code for OCR model pre-processing and post-processing 
pipeline_http_client.py # Script of sending the inference request 
web_service.py # Script of starting the server 


Starting the service 


Run the following command to start the service: 


Open a new terminal and run the following command to start the service 


# Start the service and save the running log in web_serving_log.txt 
cd PaddleOCR/deploy/pdserving/ 
nohup python web_service.py &>web_serving_log.txt & 


After the service starts, a log similar to the following will be printed in web_serving_log.txt 


--- Running analysis [inference_op_replace_pass] 

--- Running analysis [memory_optimize_pass] 

1@3@8 @9:47:57.704764 65137 memory_optimize_pass.cc:20@] Cluster name : conv2d_89.tmp_@ size: 153600 
10308 @9:47:57.704782 65137 memory_optimize_pass.cc:200] Cluster name : elementwise_add_7 size: 358400 
1@308 @9:47:57.704787 65137 memory_optimize_pass.cc:20@] Cluster name : conv2d_9@.tmp_@ size: 614400 
10308 . 704792 65137 memory_optimize_pass.cc:200] Cluster name : batch_norm_48.tmp_2 size: 9830400 
10308 . 704795 65137 memory_optimize_pass.cc:200] Cluster name : relu_2.tmp_@ size: 13107200 
10308 . 704799 65137 memory_optimize_pass.cc:200] Cluster name : conv2d_96.tmp_@ size: 2457600 
10308 . 704803 65137 memory_optimize_pass.cc:20@] Cluster name : conv2d_92.tmp_@ size: 9830400 
10308 . 704807 65137 memory_optimize_pass.cc:200] Cluster name : tmp_1 size: 2457600 

10308 . 704811 65137 memory_optimize_pass.cc:20@] Cluster name : x size: 4915200 

--- Running analysis [ir_graph_to_program_pass] 

10308 @9:47:57.706780 65160 memory_optimize_pass.cc:200] Cluster name : conv2d_89.tmp_@ size: 153600 
1@308 @9:47:57.706801 65160 memory_optimize_pass.cc:200] Cluster name : elementwise_add_7 size: 358400 
10308 :57. 706807 65160 memory_optimize_pass.cc:200] Cluster name : conv2d_90.tmp_@ size: 614400 
10308 . 706813 65160 memory_optimize_pass.cc:200] Cluster name : batch_norm_48.tmp_2 size: 9830400 
10308 . 706817 65160 memory_optimize_pass.cc:200] Cluster name : relu_2.tmp_@ size: 13107200 
10308 . 706825 65160 memory_optimize_pass.cc:200] Cluster name : conv2d_96.tmp_@ size: 2457600 
10308 . 706831 65160 memory_optimize_pass.cc:200] Cluster name : conv2d_92.tmp_@ size: 9830400 
10308 . 706836 65160 memory_optimize_pass.cc:200] Cluster name : tmp_1 size: 2457600 

10308 . 706841 65160 memory_optimize_pass.cc:20@] Cluster name : x size: 4915200 

1@308 @9:47:57.708473 65173 analysis_predictor.cc:548] optimize end 

1@308 @9:47:57.708534 65173 naive_executor.cc:107] --- skip [feed], feed -> x 

--- Running analysis [ir_graph_to_program_pass] 

10308 @9:47:57.710592 65173 naive_executor.cc:107] --- skip [save_infer_model/scale_@.tmp_@], fetch -> fetch 
10308 @9:47:57.743366 65137 analysis_predictor.cc:548] = optimize end = 

10308 @9:47:57.743443 65137 naive_executor.cc:107] --- skip [feed], feed -> x 

10308 @9: . 745395 65137 naive_executor.cc:107] --- skip [relu_2.tmp_@], fetch -> fetch 

1@308 @9: .752557 65160 analysis_predictor.cc:548] optimize end 

1@3@8 @9:47:57.752624 65160 naive_executor.cc:107] --- skip [feed], feed -> x 

10308 @9:47:57.754582 65160 naive_executor.cc:107] --- skip [relu_2.tmp_@], fetch -> fetch 


U 


S32 SSSSSS33% 


SSSSSSSss 


Figure 6: Server log 
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Sending the service request 


!python pipeline_http_client.py 


The output is as follows. 


{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['RBRRRRRRRRRERE', 'RERPEeIe') 


o"], 'tensors': []} 


You can adjust the number of concurrency in config.yml. Here only the running effect is demonstrated here, and it is set 
to 1 by default. 


det: 
#Concurrency, when is_thread_op=True, it is thread concurrency; otherwise, itu 
ois process concurrency 
concurrency: 1 


rec: 

#Concurrency, when is_thread_op=True, it is thread concurrency; otherwise, itu 
ois process concurrency 

concurrency: 1 


The inference performance data will be automatically written into the PipelineServingLogs/pipeline. 
tracer file. 


7.4.5 FAQ 


Q1: After sending the request, there is no result returned or the output decoding error is prompted 


A1: Do not set the proxy when you start the service and send the request. You can close the proxy before. The command 
to close the proxy is: 


unset https_proxy 
unset http_proxy 


7.5 End-to-side inference based on Paddle Lite 


As the mobile Internet is more and more popular, there are increasing mobile phones and embedded devices. At the same 
time, out of data security and economy of model operation, more and more models are directly run in end-side devices. 


Paddle Lite is a lightweight inference engine of PaddlePaddle. It provides efficient inference capabilities for mobile 
phones and IOT terminals, and extensively integrates cross-platform hardware to offer lightweight deployment solutions 
for end-side deployment and the landing of applications. 


This section will introduce the steps to deploy the ultra-lightweight Chinese detection and recognition model of Pad- 
dleOCR on the mobile terminal based on Paddle Lite . 


The following will show you the demonstration of the PP-OCRv2 series model running on Android. 


Android demo link 
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Since it cannot be demonstrated, here will explain how to develop the PP-OCRv2 system running program based on 
Paddle Lite. 


If you want practice, you can refer to PaddleOCR deployment document based on Paddle Lite. 


7.5.1 Enviornment Preparation 
You need to prepare the cross-compilation environment and the Paddle Lite inference library. The cross-compilation 


environment is used to generate executable files that can be used on end-side devices. It is recommended to use docker 
as the cross-compilation environment. 


7.5.2 Preparaing model 
In the model inference with Paddle Lite, it is necessary to first convert the inference model into an optimized model for 
Paddle Lite inference (the suffix is usually nb). A variety of strategies are adopted to automatically optimize the original 


model here, including quantization, subgraph fusion, hybrid scheduling, and kernel optimization. The optimized model 
is lighter and faster. 


7.5.3 Compiling 


Runmake —j4 for compilation and get the executable file. In the first execution, of this command will download dependent 
libraries such as opencv. After the download is complete, run make —J again. 


7.5.4 Uploading to mobile terminals such as mobile phones 


Use tools such as adb to transfer executable files, model files, and configuration files to mobile devices such as mobile 
phones. 
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7.5.5 Running 


Run the executable file on the mobile terminal to get the result, and the output example. 


Figure 7: Output results on the mobile terminal 


7.5.6 FAQ 


Q1: What if I want to change the model, do I need to go through the whole process again? 


A1: If you have gone through the above steps, you only need to replace the .nb model file. Also, update the dictionary at 
the same time. 


Q2: How to test with another picture? 
A2: Replace the .jpg test image under debug with the image you want to test, and just push adb to the phone. 
Q3: How to package the demo and send it to the mobile APP? 


A3: This demo aims to provide the core algorithm part for running OCR on mobile phones. You can refer to Pad- 
dleOCR/deploy/android_demo. It is an example of packaging this demo into an application on the mobile phone. 
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7.6 Homework 


Please refer to Inference Deployment Objective Questions and Inference Deployment Practice Questions part. 
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EIGHT 


DOCUMENT ANALYSIS TECHNOLOGY 


8.1 Introduction to Document Analysis Technology 


This chapter mainly introduces the theoretical knowledge of document analysis technology, including its background, 
algorithm categories and ideas behind algorithms. 


In this chapter, you can learn: 
1. Categories and ideas of layout analysis 
2. Categories and ideas of table recognition 
3. Categories and ideas of information extraction 


Documents are the carrier of information. Its different layout fits different kinds of information, such as the list and the 
ID card. Document analysis is of automatic information reading, interpretation, and extraction. It often includes the 
following research directions: 


1. Layout analysis module: It divides each document page into different content regions. This module can be used not 
only to divide relevant and irrelevant regions, but also to classify the content it recognizes. 


2. Optical Character Recognition (OCR) module: It locates and recognizes all texts in the document. 
3. Table recognition module: It recognizes and converts table in the document into an excel file. 


4. Information extraction module: Use OCR results and image information to understand and identify the specific 
information expressed in the document or the relationship between the information. 


Since the OCR module has been detailed in the previous chapters, the other three modules will be introduced one by one 
here. In each part, the classic or common methods and datasets of the module will be introduced. 


8.1.1 Layout Analysis 
Background Introduction 


Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. It aims to 
classify document images. The categories include plain texts, titles, tables, pictures, and lists. However, many factors 
make layout analysis still a challenging task, involving the diversity and complexity of document layouts and formats, 
poor quality of document images, and the lack of large-scale annotated datasets. The visualization of the layout analysis 
task is shown in the figure below: 
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The existing solutions are generally based on object detection or semantic segmentation, which detect or segment different 


(a) Main Page Annotation Example 1 


Texture 7: Annotation Examples in HJDataset. (a) and (b) show two examples for the labeling of main pages. The boxes 
are colored differently to reflect the layout element categories. Illustrated in (c), the items in each index page row are 


categorized as title blocks, and the annotations are denser. 


Text over union (IOU) level [0.50:0.95]*, on the test data. 
eneral, the high mAP values indicate accurate detection o 

ie layout elements. The Faster R-CNN and Mask R-C. 
chieve comparable results, better than RetinaNet. Notice: 
bly, the detections for small blocks like title are less pre 

ise, and the accuracy drops sharply for the title category. 


Tit Pre-training for other datasets 


ew publications, researchers usually do not generate larg 


ale ground truth data to train their layout analysis models. 
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odels. To this end, we conduct two experiments. First wi 
xamine how layout analysis models trained on the mai 
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istorical Japanese documents. 
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(b) Main Page Annotation Example 2 


(c) Index Page Annotation Example 
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Tablegory Faster R-CNN Mask R-CNN* RetinaNet 
Page Frame 99.046 99.097 99.038 
Row 98.831 98.482 95.067 
Title Region 87.571 89.483 69.593 
Text Region 94.463 86.798 89.531 
Title 65.908 71.517 72.566 
Subtitle 84.093 84.174 85.865 
Other 44.023 39.849 14.371 


Figure 1: Diagram of the layout analysis 
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patterns in the document as different objects. 


Some representative papers are divided into the above two categories, as shown in the following table: 


Category Main papers 
Method based on object detection Visual Detection with Context, Object Detection, VSR 
Method based on semantic segmentation | Semantic Segmentation 


Method Based on Object Detection 


Soto Carlos [1] learns from the object detection algorithm Faster R-CNN, uses contextual information and the inherent 
location information of the document content to improve the performance of the region detection. Li Kai [2] et al. also 
propose a document analysis method based on object detection, which solves the cross-domain problem by introducing 
a feature pyramid alignment module, a region alignment module, and a rendering layer alignment module. These three 
modules complement each other, adjust the domain from a general image perspective and a specific document image 
perspective, thus solving the inconsistency between large label training datasets and the target domain. The following 
figure is a flow chart of layout analysis based on the object detection algorithm Faster R-CNN. 
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Figure 2: Flow chart of layout analysis based on Faster R-CNN 


Methods Based on Semantic Segmentation 


Sarkar Mausoom [3] et al. propose a priori-based segmentation mechanism to train a document segmentation model on 
high-resolution images, which solves the problem that structures of dense regions cannot be distinguished and merged 
due to the excessive shrinkage of the original image. Zhang Peng [4] et al. propose a unified framework VSR (Vision, 
Semantics and Relations) for document layout analysis. The framework uses a two-stream network to extract the visual 
and semantic features, and adaptively fuses these features through the adaptive aggregation module, going beyond the 
low efficiency of fusing different modules and lack of relationship modeling between layout components in the existing 
CV-based methods. 
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Datasets 
Although the existing methods can solve the layout analysis task to some extent, these methods rely on a large amount of 
labeled training data. Recently, many datasets have been proposed to be used in document analysis. 


1. PubLayNet[5]: The dataset contains 500,000 document images, of which 400,000 are used for training, 50,000 
are used for verification, and 50,000 are used for testing. Tables, texts, pictures, titles, and lists are labelled in the 
dataset. 


2. HJDataset[6]: The dataset contains 2271 document images. Besides bounding boxes and masks of the content, it 
also involves the hierarchical structure and reading order of layout elements. 


Some samples of the PubLayNet dataset are shown in the figure below: 


Jnternational Journal of Otolaryngology 3 
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Figure 3: PubLayNet example 
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8.1.2 Table Recognition 


Background Introduction 


Tables are common page elements in various types of documents. With the explosive growth of various documents, it 
becomes urgent to learn to efficiently find tables from documents and obtain their contents and structures, which is called 
Table Recognition. The difficulties of table recognition are summarized as follows: 


1. The types and styles of tables are complex and diverse, such as different rows and columns combined, various content 
types, etc. 


2. The styles of the document itself is various. 
3. The lighting environment during shooting is complex, etc. 


The task of table recognition is to convert table information of an document to an excel file. The task visualization is as 
follows: 
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Figure 4: Example image of table recognition. The left is the original image, and the right is the result after table recognition, presented in 


Existing table recognition algorithms can be divided into the following four categories according to the principle of table 
structure reconstruction: 


1. Method based on heuristic rules 
2. CNN-based method 

3. GCN-based method 

4. Method based on End to End 
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Some representative papers are divided into the above four categories, as shown in the following table: 


Category Idea Main papers 

Method Artificially designed rules, connected | T-Rect, pdf2table 

based —on | domain detection analysis and process- 

heuristic ing 

rules 

CNN-based | target detection, semantic segmentation | CascadeTabNet, Multi-Type-TD-TSR, LGPMA, 
method tabstruct-net, CDeC-Net, TableNet, TableSense, Deep- 


desrt, Deeptabstr, GTE, Cycle-CenterNet, FCN 
GCN-based | Consider the table recognition as a | GNN, TGRNet, GraphTSR 


method graph reconstruction problem on the 
basis of graph neural network 
Method Use attention mechanism Table-Master 
based = on 
End to End 


Traditional Algorithm Based on Heuristic Rules 


Early research on table recognition was mainly based on heuristic rules. For example, the T-Rect system proposed by 
Kieninger [1] et al. analyze the connected domain of document images bottom-up, merge them according to defined 
tules to obtain logical texts. Then, pdf2table proposed by Yildiz[2] et al. is the first method of table recognition on PDF 
documents, which utilizes some unique information in PDF files (such as texts, drawing paths and other information that 
are difficult to get in image documents) to assist with table recognition. Recently, Koci[3] et al. present the layout region 
in the page as a graph, and then used the Remove and Conquer (RAC) algorithm to identify the table as a subgraph. 


2.0750} [segmehitex .c] 


Figure 5: Schematic of heuristic algorithm 


Method Based on Deep Learning CNN 


With the rapid development of deep learning technology in computer vision, natural language processing, speech pro- 
cessing and so on, researchers have applied deep learning technology to the field of table recognition and achieved good 
results. 


In the DeepTabStR algorithm, Siddiqui Shoaib Ahmed [12] et al. describe the table structure recognition problem as a 
object detection problem, and use deformable convolution to better detect table cells. Raja Sachin[6] et al. propose that 
TabStruct-Net combines cell detection and structure recognition visually to perform table structure recognition, which 
solves the problem of recognition errors due to large changes in the table layout. However, this method cannot deal with 
many empty cells in rows and columns. 
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(c) Cell Predictions 


(b) Column Predictions 


Figure 6: Schematic of algorithm based on deep learning CNN 


(a) Row Predictions (b) Column Predictions (c) Cell Predictions 


Figure 7: Example of algorithm errors based on deep learning CNN 


The previous table structure recognition methods start from the elements of different granularities (row/column and text 
region), and it is easy to ignore the problem of merging empty cells. Qiao Liang [10] et al. propose a new framework 
LGPMA, which makes full use of the information from local and global features through mask re-scoring strategy, and 
then can obtain more reliable aligned cell region, and finally introduces table structure restoration pipelines including cell 
matching, empty cell searching and merging to handle the problem of table structure recognition. 


In addition to the above algorithm for table recognition alone, there are also some methods that detect and recognize tables 
in one model. Schreiber Sebastian [11] et al. propose DeepDeSRT, which uses Faster RCNN for table detection and 
FCN semantic segmentation model for detecting structures, and rows and columns of tables. But it uses two independent 
models to finish the two tasks. Prasad Devashish [4] et al. propose an end-to-end deep learning method CascadeTabNet, 
which uses the Cascade Mask R-CNN HRNet model to detect tables and recognize their structures simultaneously, which 
goes beyond the limitations of using two independent methods for table recognition. Paliwal Shubham [8] et al. propose a 
novel end-to-end deep multi-task architecture TableNet, which is used for table detection and structure recognition. At the 
same time, additional spatial semantic features are added to TableNet during training to further improve the performance 
of the model. Zheng Xinyi [13] et al. propose a systematic framework GTE for table recognition, using a cell detection 
network to guide the training of the table detection network. Also, they put forward a hierarchical network and a new 
cluster-based cell structure recognition algorithm. This framework can be connected to the back of any object detection 
model to facilitate the training of different table recognition algorithms. Previous research mainly focus on analyzing table 
images with simple layouts and good alignment in scanned PDF documents. However, the tables in real scenes maybe 
complex and seriously distorted, curved, or covered. Therefore, Long Rujiao [14] et al. also construct a table recognition 
dataset WTW in real complex scenarios, and come up with a Cycle-CenterNet method, which uses the cyclic pairing 
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module optimization and the proposed new pairing loss to accurately group discrete units into structured tables, and the 
performance of table recognition is improved. 


Table with lines/ 
Table line 
detection 


background removal/ 
ROI feature extraction 


Result 


Table without lines/ 
Text detection 


Figure 8: Schematic of end-to-end algorithm 


The CNN-based method is not good at handling the tables crossing columns and rows, so there are two research methods 
to tackle them. 


Method Based on Deep Learning GCN 


In recent years, with the rise of the Graph Convolutional Network (GCN), some researchers have tried to apply GCN to 
the problem of table structure recognition. Qasim Shah Rukh [20] et al. consider the table structure recognition problem 
as a problem with compatibility of graph neural networks, and design a novel differentiable architecture that can make use 
of the advantage of convolutional neural networks on feature extraction and the effective interaction between the vertices 
of the graph neural network. But this method only uses the location features of cells, and does not use their semantic 
features. Chi Zewen [19] et al. propose another graph neural network, GraphTSR, for table structure recognition in 
PDF files. It takes table cells as input, and then uses the connection between edges and nodes of the graph to predict the 
relationship between cells to identify the table structure, helping to solve the recognition of cell across rows or columns. 
Xue Wenyuan [21] et al. reformulate the table structure recognition as table reconstruction, and propose an end-to-end 
method, TGRNet, which includes the cell detection branch and the cell logic location branch. These two branches predict 
the spatial and logical positions of different cells, tackling the problem that the previous method did not pay attention to 
the logical position of cells. 


Diagram of GraphTSR table recognition algorithm: 
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Figure 2: Overview of our method. Given a table in PDF as input, our method recognize its structure by the 
following four steps: (a) Pre-processing: obtaining cell contents and their corresponding bounding box from PDF; 
(b) Graph construction: building an undirected graph on these cells; (c) Relation prediction: predicting adjacent 
relations by our proposed GraphTSR; (d) Post-processing: recovering table structure from the labeled graph. 


Figure 9: Diagram of GraphTSR 


Method Based on End-to-End 


Different from other methods using post-processing to reconstruct the table structure, the method based on the end-to-end 
use the network to complete the HTML representation output of the table structure 


row1-2,col1|row1,col2 <td rowspan="2">row 1, col 1</td> 


content |row2,col2 </tr> 
<tr> 


Table recognition |“ crq><b>row 2, col 1</b></td> 


<td>row 2, col 2</td> 


A person riding a 
motorcycle on a dirt road. 


Two dogs play in the grass. 


Figure 10: Input and output of the end-to-end method 


Figure 11: Image Caption example 


The method mainly uses Seq2Seq of Image Caption to predict the table structure, such as methods based on attention or 


transformer. 


Figure 12: Schematic of Seq2Seq 
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Ye Jiaquan [22] gets the table structure model in the output in TableMaster by improving the Master algorithm based 
on Transformer. In addition, a branch is added for the coordinate regression of the box. The author did not split the 
model into two branches in the last layer, but decoupled the sequence prediction and the box regression after the first 
Transformer decoding layer. The comparison between its network structure and the original Master network is shown in 
the figure below: 
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Figure 1: (a) Architecture of vanilla MASTER; (b) Architecture of table structure MASTER 


Figure 13: Left: Master network, right: TableMaster network 


Datasets 
Since the deep learning method is data-driven, a large amount of labeled data is required to train the model. The small 
size of the existing datasets is also an important constraint, so some datasets have also been proposed. 

1. PubTabNet[16]: It contains 568k table images and structured HTML representations. 


2. PubMed Tables (PubTables-1M) [17]: A table structure recognition dataset contains highly-detailed structural 
annotations. 460,589 pdf images are used for table detection tasks, and 947,642 table images are used for table 
recognition tasks. 


3. TableBank[18]: It refers to a table detection and recognition dataset using Word and Latex documents on the 
Internet to construct table data containing 417K high-quality annotations. 


4. SciTSR[19]: It refers to a table structure recognition dataset where most of the images are converted from the 
paper. And those images contain 15,000 tables from PDF files and corresponding structure tags. 


5. TabStructDB[12]: It includes 1081 table regions, which are marked with dense row and column information. 


6. WTW[14]: It refers to large-scale dataset of scene table detection and recognition, which contains 14,581 images 
in total whose tables are distorted, curved, or occluded. 


Data set example 
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Figure 2. Illustration of the three subtasks of table extraction addressed by the PubTables-1M dataset. 


Figure 14: Sample of PubTables-1M dataset 
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Figure 15: Sample of WTW data set 
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8.1.3 Document VQA 


The boss sends a task: to develop an ID card recognition system 
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How to choose a plan: 
1. Use rules to extract information after text detection 
2. Use scale types to extract information after text detection 


3. Resort to outsourcing 


Background Introduction 


In the VQA (Visual Question Answering) task, questions and answers mainly focus on the content of the image. But since 
the core information of text images is the text, VQA can be divided into two types: Text-VQA for natural scenes and 
DocVQA for scanned texts. The relationship between the three is shown in the figure below. 


Text-VQA 


Figure 16: The hierarchy of VQA 


The sample pictures of VQA, Text-VQA and DocVQA are shown in the figure below. 
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Task | VQA Text-VQA DocVQA 
type 
Task Ask questions regarding picture | Ask questions regarding texts on | Ask questions regarding texts of 
de- content pictures document images 
scrip- 
tion 
BUSINESS EXPENSE VOUCHER 
“Chaties A. Blt Mea 
Mailing Address (if applicable) ‘Extension Number 
Sr. VP/IGC (910) 741-0673 
11803 
Exeoutive 
aoe _— Q: What is the Extension Number as per the 
Question: What can the red object on the ground be used for ? Q: what is the name of the store? voucher? 
Answer: Firefighting TextVQA: 7 : ' : 7 
Sam- Support fact: Fire hydrant can be used for fighting fires. A: tellus mater inc. GT: (91 0) 741-0673 
ple 
picture 


Because DocVQA is closer to actual application scenarios, it is more widely used in academic research and industrial 
scenarios. Usually, the questions asked in DocVQA are fixed. For example, the questions in the ID card scenario are 


generally: 
1. What is the ID number? 
2. What is your name? 


3. What ethnic group are you from ? 


Figure 17: Example of an ID card 


Based on this prior knowledge, research on DocVQA has begun to focus on the Key Information Extraction (KIE) task. 
Here we mainly discuss the KIE-related research. The KIE mainly extracts the key information needed from the image, 
such as the name and the ID number in the ID card. 


KIE is usually divided into two sub-tasks for research 
1. SER: It refers to Semantic Entity Recognition, which is to classify each detected text, such as mark it as the name 
of the ID card . The example is shown in the picture on the left. 


2. RE: It refers to Relation Extraction, which classifies each detected text, such as into questions and answers. Then it 
finds the corresponding answer to each question. As shown in the figure below, the red and black boxes represent 
the question and the answer, respectively, and the yellow line represents the relation between the question and the 


answer. 
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The general KIE method is researched based on Named Entity Recognition (NER) [4], but this kind of methods only 
use the text information in the image and lack visual and structural information, so they are not so accurate. So recently, 
visual and structural information have begun to be integrated together. According to their principles used in the fusion of 
multimodal information, these methods can be divided into the following three types: 


1. Grid-based method 

2. Token-based method 

3. GCN-based method 

4. Based on the End to End method 


Some representative papers are included in the above three categories: 


Category Ideas Main Papers 

Grid-based method Fusing multi-modal information on images (texts, lay- | Chargrid 
outs, images) 

Token-based method Using methods such as Bert for multi-modal information | LayoutLM, LayoutLMv2, 
fusion StrucText, 

GCN-based method Using the graph network structure for multi-modal infor- | GCN, PICK, SDMG-R, 
mation fusion SERA 

Method based on End- | Unifying OCR and key information extraction into one | Trie 

to-End network 
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Grid-Based Method 


The Grid-based method performs multimodal information fusion in images. Chargrid[5] firstly detects and recognizes 
the characters in images, constructs the network input by filling the one-hot code into the corresponding character regions 
(the non-black part in the right image below) , and make the input pass through the CNN network of the encoder-decoder 
structure to perform coordinate detection and classification of key information. 


Raw document Chargrid 


Figure 1: Example for a document page (left) and 
corresponding chargrid representation g (right). 


Figure 19: Example of Chargrid data 


Encoder Decoder: Semantic Segmentation 


Raw data Chargrid 


eae: : —— ig 
2) = = © - 
Decoder: Bounding Box Regression 


a. ac? 


Figure 2: Network architecture for document understanding, the chargrid-net. Each convolutional block 
in the network is represented as a box. The height of a box is a proxy for feature map resolution while 
the width is a proxy for the number of output channels. C' corresponds to the number of ’base’ channels, 
which in turns corresponds to the number of output channels in the first encoder block. d denotes dilation 
rate. 


Figure 20: Chargrid network 


Compared with traditional methods based only on text, this method can use both text information and structural infor- 
mation, so its accuarcy can be higher. But it doesn’t organically combine the two. 
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Token-Based Method 


LayoutLM[6] encodes 2D position information and text information together into the BERT model, and draws on the 
pre-training of Bert in NLP to pre-train large-scale datasets. In downstream tasks, LayoutLM also introduces image in- 
formation to improve the performance of the model. Although LayoutLM combines text, location and image information, 
the image information is fused in the training of downstream tasks, which makes the multi-modal combination not so 
sufficient. Based on LayoutLM, LayoutLMv?2 [7] integrates image information with text and layout information in the 
pre-training stage through transformers, and also adds a spatial perception self-attention mechanism to the Transformer 
to assist the model in the integration of visual and text features. Although LayoutLMv? fuses text, location and image 
information in the pre-training stage, the visual features learned by the model are not fine enough due to the limitation 
of the pre-training task. So StrucTexT [8] proposes two new tasks, Sentence Length Prediction (SLP) and Paired Boxes 
Direction (PBD) in the pre-training to help the network learn fine visual features. The SLP task makes the model learn the 
length of the text segment, and the PDB task allows the model to learn the matching relationship between box directions. 
In this way, the deep cross-modal fusion between text, visual and layout information can be accelerated. 


#3 Text Image Matching | ea \ 
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al | [ cy 

dawilominn deceit Une el = | = coils eee ee 

“fpepescnatoes L_Y! v2 | [v3] [ve] [iteesi] [_] [owas] [713] [ivasxi] [as v6 | [_17_][iseri 
Transformer Layers 
with Spatial-Aware Self-Attention Mechanism 
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Figure 21: The flow chart of transformer algorithm Figure 22: The flow chart of LayoutLMv?2 algorithm 


GCN-Based Method 


Although the existing GCN-based methods [10] make use of text and structure information, but do not make good use 
of image information. PICK [11] adds image information to the GCN network and proposes a graph learning module to 
automatically learn edge types. SDMG-R [12] encodes the image as a bimodal graph. The nodes of the graph are visual 
and textual information of the text region. The edges represent the direct spatial relationship between adjacent texts. By 
iteratively spreading information along the edges and inferring graph node categories, SDMG -R solves the problem that 
existing methods fail to handle novel templates. 


The PICK flow chart is shown in the figure below: 
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Figure 23: The flow chart of PICK algorithm 


SERA[10] introduces the biaffine parser in dependency syntax analysis to the document relation extraction, and GCN is 
used to fuse text and visual information. 


Scorer Scorer 
Encoder | Encoder 
Label Embedding POS Embedding 


Pretrained Encoder Pretrained Encoder 


Word Group Word 


Layout Info. 


Figure 24: The flow chart of SERA algorithm 
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Method Based on End-to-End 


Existing methods divide KIE into two independent tasks: text reading and information extraction. However, they just focus 
on improving the task of information extraction, ignoring that text reading and information extraction are interrelated. 
Therefore, Trie [9] proposes a unified end-to-end network where the two tasks can be learned at the same time and 
reinforce each other in the process. 


Martin 


Phone: 132°*°*6116 
E-mail: abcl23*** @gmail.com 


2018.09--2019.06 Computer Science and Technology Pu*** University 
Master of Science GPA; 3.88 /4.0 

2014.09--2018.06 Computer Science and Technology Fu*** University 
Bachelor of Scicnee GPA: 3.83/ 4.0 


Figure 3: Overall architecture. The network predicts text regions, text content and extract entities of interest in a single forward pass. 


Figure 25: The flow chart of Trie algorithm 


Datasets 


The datasets used for KIE mainly include: 


1. SROIE: Its task3 [2] aims to extract four pices of predefined information from the scanned receipt: company, date, 
address or the total number. There are 626 samples in the datasets for training and 347 samples for testing. 


2. FUNSD: FUNSD dataset [3] is used to grasp table information from scanned documents. It contains 199 marked 
forms from real scenarios. 149 of the 199 samples are used for training and 50 for testing. The FUNSD dataset 
assigns a semantic tag to each word: question, answer, title or other. 


3. XFUN: The XFUN dataset is a multilingual dataset proposed by Microsoft. It contains 7 languages and each 
language has 149 training sets and 50 test sets. 
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Figure 26: Example of SROIE Figure 27: Example of XFUN 
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8.2 OCR Table Recognition Practice 


This section will introduce how to use PaddleOCR to train and run the table recognition algorithm, including: 
1. Understanding the principle of the table recognition algorithm 


2. Mastering the training and prediction of PaddleOCR table recognition code 


8.2.1 Quick Start 


To quickly demonstrate the PP-Structure prediction, first download the PaddleOCR code and install dependency packages. 


# clone PaddleOCR code 
! git clone -b release/2.4 https://gitee.com/paddlepaddle/PaddleOCR 


# Install dependencies 

! pip install -U https://paddleocr.bj.bcebos.com/whl/layoutparser-—0.0.0-py3-none-any. 
owhl 

! pip install -r PaddleOCR/requirements.txt 

! pip install pandas 


After the installation, quickly complete the table recognition through the following command: 


# Switch to working directory 
import os 
os.chdir('PaddleOCR/ppstructure') 


# Download the model 

! mkdir inference && cd inference 

# Download the detection model of the ultra-lightweight table English OCR model and. 
yblaeashye) Ibe 

! wget -P ./inference/ https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_ 
odet_infer.tar && cd inference && tar xf ch_PP-OCRv2_det_infer.tar && cd 

# Download the recognition model of the ultra-lightweight table English OCR model andu 
Syblavaalye) Ibe 

! wget -P ./inference/ https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_ 
orec_infer.tar && cd inference && tar xf ch_PP-OCRv2_rec_infer.tar && cd 

# Download the ultra-lightweight English table inch model and unzip it 

! wget -P ./inference/ https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_ 
somobile_v2.0_table_structure_infer.tar && cd inference && tar xf en_ppocr_mobile_v2. 
o0_table_structure_infer.tar && cd 


# Read the image and display it 
import cv2 

from matplotlib import pyplot as plt 
smatplotlib inline 


(continues on next page) 


236 Chapter 8. Document Analysis Technology 


Dive into OCR 


(continued from previous page) 


img = cv2.imread('../doc/table/table.jpg') 
plt .imshow (img) 


<matplotlib.image.AxesImage at 0x7£87db2672d0> 
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# https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppstructure/table/predict_ 
otable.py#L55 


from table.predict_table import TableSystem,to_excel 

from utility import init_args 

# Initialize parameters 

args = init_args().parse_args (args=[]) 
args.det_model_dir='inference/ch_PP-OCRv2_det_infer' 
args.rec_model_dir='inference/ch_PP-OCRv2_rec_infer' 
args.table_model_dir='inference/en_ppocr_mobile_v2.0_table_structure_infer' 
args.rec_char_dict_path='../ppocr/utils/ppocr_keys_v1.txt' 

args tabllescharadictepath—" .1/ppocr/ utils dick) tablemstructuneaGdrce txt 
args.det_limit_side_len=736 

args.det_limit_type='min' 

args.output='../output/table' 

args.use_gpu=False 


# Initialize the table recognition system 
table_sys = TableSystem(args) 

img = cv2.imread('../doc/table/table.jpg') 
# Perform table recognition 

pred_html = table_sys (img) 

# Store the result to an excel file 
to_excel (pred_html, '1.xlsx') 

# (html Ee 


(continues on next page) 
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(continued from previous page) 


from IPython.core.display import display, HTML 
display (HTML (pred_htm1) ) 


[2022/03/17 21:27:08] root DEBUG: dt_boxes num : 78, elapse : 0.5478317737579346 
[2022/03/17 21:27:10] root DEBUG: rec_res num : 78, elapse : 1.2926230430603027 


<IPython.core.display.HTML object> 


# Read the excel and display it 
import pandas as pd 
df = pd.read_excel ('1.xlsx').fillna('') 


joiealgone (ele) 
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8.2.2 Explanation of Prediction Principle: 
Introduction to the Overall Pipeline 


PP-Structure’s table recognition model algorithm is an end-to-end algorithm. 
The table recognition algorithm consists of three models: 
1. Text detection model: It is used to detect the single line text in the table. 
2. Text recognition model: It is used to recognize the detected text. 


3. Table structure and cell prediction model: It is used to predict HTML information of the table structure and cell 
coordinates. 


The pipeline of table recognition algorithm is shown in the figure below: 
The process is: 
1. Use the text detection model to detect the single line text in the table; 
2. Use the text recognition model to recognize the detected text. At this step, we get the text box and text information; 


3. Use table structure and cell prediction model to predict HTML information of the table structure and cell coordi- 
nates; 
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4. Aggregate the text box in | and the cell coordinates in 3, as shown in the figure below. Determine if aggregation is 
needed according to the IOU between the red text detection boxes and the blue cell coordinate detection boxes. 


5. After the text box aggregation, sort the text boxes from top to bottom and from left to right. You can get the 
corresponding text information with the index of the sorted text boxes, and then makes string splicing to get the 


text content of the cells. 


Introduction to Table Structure Inference Model 


Table recognition requires three models: text detection, text recognition and table structure recognition models. The text 
detection and recognition models have been introduced, and here the table structure inference model will be elaborated. 


The table structure inference model predicts the table structure and detects table cell coordinates. The structure model is 
modified from the RARE algorithm, and improvements have been made mainly in the following aspects: 


Input Data 


For the text recognition model, each character marked in the dataset is independent, but in the table structure inference 
model, the category to be predicted is not a single character. The following is a dictionary comparison between RARE 


and the table structure inference model: 


modél dictionary 


RARE‘, yr, v, ‘D, S, , Pp. S, ‘v, b, S, ‘ 2 , P, b, y, ; 


colspan="5"', 
colspan="4"', colspan="6"', ' rows- 
" colspan="7"', 


rowspan="8"', ' rows- 


'</thead>', 
colspan="2"', ' 
" rowspan="4"', 


" colspan="10"', 
colspan="8"', 
rowspan="10"', 


Ta- "sos', "<thead>', Veer > hs 

ble "<tbody>', '</tbody>', '<td', 

struc} colspan="3"', ' rowspan="2"', 

ture | pan="3""', ' colspan="9"', 

model ' rowspan="5"', ' rowspan="9"', 
pan="6"', ' rowspan="7"', 


The table structure inference model treats a string like <t head> as one character to recognition. 
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Model 


The comparison chart of the table structure recognition model and RARE is as follows: 
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Figure 3: Table structure recognition model 


The RARE model is composed of TPS+CNN+RNN-+AttentionHead, and functions of each part are as follows: 
1. TPS: It corrects the curved texts in the images. 
2. CNN: It extracts features from the images. 
3. RNN: It further enhances the extracted features and extracts semantic features. 
4. AttentionHead: It performs the output. 


In the table structure recognition model, the input image is a complete image, so the TPS module is removed. In addition, 
it is proved through experiments that RNN has little effect on the result. Therefore, the RNN module can be removed. 
The structure of the table structure recognition model is CNN+AttentionHead. 


In order to output the cell coordinates, we have tried to detect the cell coordinates in the detection model. On the basis 
of the DB model, schemes 2 and 3 have been put forward. 
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Solution 


1. Single-line text detection 
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It can be seen that detecting texts and cells in the segmentation model will lead to the peculiarities of GT: Is the GT of 
the background between each line in the cell text or background? 


Among the three models of the entire table recognition pipeline, only the text detection and table structure recognition 
models can obtain the information of the entire image. Therefore, an additional regression-based branch is added to the 
AttentionHead of the table structure recognition model to complete the cell coordinate (x0, y0, x1, y1) detection. 


Forward Analysis of Table Structure Inference Model 


Forward analysis of the model analyzes the output shape changes in each module of the image input from preprocessing 
to network output to better understand table cell inference and table structure inference models. The modules involved 


are as follows: 


Input Data Processing 


Type Module Name 
Data Processing | ResizeTableImage 
Data Processing | PaddingTableImage 
Backbone MobileNetV3 

Head TableAttentionHead 


In this example, the input image and the output of the data processing module are visualized as follows: 


# Switch to the PaddleOCR directory 


Osis Clovelstia(Y 5 4/29) 


from ppocr.data import create_operators, 


plt.figure(figsize=(24,8) ) 


# Read the input image 


img = cv2.imread('doc/table/table.jpg') 


# Display the input image 


plea subploe Gy 737,.1)) 


transform 


(continues on next page) 
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(continued from previous page) 


plt.title('src, shape: {}'.format (img.shape) ) 
plt .imshow (img) 


# Implement ResizeTableImag 
# https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/data/imaug/gen_table_ 
omask.py#L182 


pre_process_list = [{'ResizeTableImage': {'max_len': args.table_max_len }}] # Scalew 
othe long side of the picture to the specified length, and scale the short side in. 
sequal proportions 

preprocess_op = create_operators (pre_process_list) 

data = {'image': img} 


data = transform(data, preprocess_op) 


# Display the image after ResizeTableImage 

pit. subplot (372) 

plt.title('ResizeTableImage, shape: {}'.format (data['image'].shape) ) 
plt.imshow(data['image']) 


# Implement PaddingTableImage 
# https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/data/imaug/gen_table_ 
omask.py#L232 


pre_process_list = [{'PaddingTableImage': None} ] 
preprocess_op = create_operators (pre_process_list) 


data = transform(data, preprocess_op) 


# Image after displaying PaddingTableImage 

Pilt.subploti(l.3,.2)) 

plt.title('PaddingTableImage, shape: {}'.format (data['image'].shape) ) 
plt.imshow(data['image']/255) 

plt.show() 


# Define a list of processing ops 
pre_process_list = [ 
{'ResizeTableImage': {'max_len': args.table_max_len }}, 
{{Normadazedimage! = (Uscalel 4/2552. "means (O48 5,5 0-4 516,) Oe 406F, vst Oe 229i 
Bees, Ureeoly (oroer "= nwa), 
{'PaddingTableImage': None}, 
{'ToCHWImage': None} 
] 
# Create op list 
preprocess_op = create_operators (pre_process_list) 
# Execute op list 
data = {'image': img} 


data = transform(data, preprocess_op) 
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PaddingTablelmage, shape-(488, 488, 3) 
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Methods R P F FPS Methods R P F FPS PixelLink [4] | 73.2 83.0 778 
SegLink (26) | 70.0 860 770 89 SegLink [26] | 700 860 770 89 woo | TextSnake [18] | 73.9 83.2 783 1.1 
so} PixelLink[4] | 73.2 83.0 77.8 - ©) PixelLink [4] | 73.2 83.0 77.8 - TextField [37] | 75.9 874 813 5.2 
TextSnake [18] | 73.9 83.2 783 11 TextSnake [18] | 73.9 832 783 IL. MSR([38] 76.7 874 81.7 - 
TextField [37] | 75.9 874 813 5.2 100) TextField (37] | 75.9 874 81.3 5.2 FTSN [3] 71.1 87.6 82.0 
100 MSR{[38] 76.7 874 81.7 - MSR{[38] 76.7 874 81.7 - LSE[30] 81.7 84.2 829 - 
FTSN [3] 71.1 87.6 82.0 - 150 FTSN [3] 71.1 876 82.0 . 200 CRAFT [2] 78.2 882 829 86 
LSE([30) 81.7 842 829 - LSE[30] 81.7 842 829 - MCN [16] 79 88 83 - 
350 CRAFT [2] 78.2 882 82.9 8.6 200 CRAFT [2] 78.2 88.2 82.9 8.6 ATRR[35] 82.1 85.2 83.6 - 
MCN [16] 79 88 83 - MCN [16] 79 88 83 - PAN [34] 83.8 844 84.1 302 
ATRR[35] 82.1 85.2 83.6 : 250 ATRR[35] 82.1 85.2 83.6 - 300 DB[!2} 792 9S 849 32.0 
ze PAN [34] 83.8 844 84.1 30.2 PAN [34] 83.8 844 84.1 302 DRRG [41] | 82.30 88.05 85.08 - 
DB[!2] 79.2 915 849 320 300 DB[!2] 792 915 849 320 Ours (SynText) | 80.68 85.40 8297 12.68 
on ok’. i 82.30 — 85.08 = oa (4) 82.30 ae ad ~ Ours (MLT-17) | 84.54 86.62 85.57 12.31 
urs (SynText) | 80.68 ) 8297 12. 350 urs (SynText) | 80.68 85.4 12.68 
Ours (MLT-17) | 84.54 86.62 85.57. 12.31 Ours (MLT-17) | 84.54 86.62 85.57 12.31 


C) 50 100 150 200 250 300 350 0 100 200 300 400 


# Download the pre-trained model 

! wget -P ./pretrained_models/ https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ 
soppocr_mobile_v2.0_table_structure_train.tar && cd pretrained_models && tar xf en_ 
oppocr_mobile_v2.0_table_structure_train.tar && cd 

# Downloaded the pre-trained model 

import paddle 


# Read pre-training parameters and divide them into backbone parameters and head. 
oparameters 
pretrain_params = paddle.load('pretrained_models/en_ppocr_mobile_v2.0_table_structure_ 
otrain/best_accuracy.pdparams') 
def filter_params (pretrain_params, prefix): 
new_dict = {} 


for k,v in pretrain_params.items(): 
if k.startswith (prefix): 
new_dict [k.replace(prefixt'.','')] =v 
return new_dict 
# Extract parameters 


backbone_dict = filter_params (pretrain_params, 'backbone') 
head_dict = filter_params (pretrain_params, 'head') 
Backbone 


The backbone is the same as the detected backbone, and both output four feature maps with sizes of 1/4, 1/8, 1/16 and 
1/32 of the input image. Relevant backbones have been introduced in the text detection chapter, and will not be repeated 
here. 


# https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/modeling/backbones/det_ 
omobilenet_v3.py 


from ppocr.modeling.backbones import build_backbone 

# Initialize the backbone 

backbone = build_backbone (dict (name='MobileNetV3', scale=1.0,model_name='large') ,model_ 
otype='table') 

backbone.eval () 

# Load backbone parameters 

backbone.set_state_dict (backbone_dict) 


import numpy as np 
x = np.expand_dims (data['image'],axis=0) 
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x = paddl 


backbone_out 


.to_tensor (x) 
backbone (x) 


for item in backbone_out: 
print (item. shape) 


[1, 
[1, 
[1, 
[1, 


24, 
40, 
A, 
960, 


Head 


ley 
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The input of the head is the four feature maps output by the backbone, and the output is the inference result of the table 
structure and cell coordinates 


The meanings of the input parameters are: 


Parameter 


Meaning 


in_channels 


The number of channels of the input feature map 


hidden_size 


The hidden layer unit of the RNN module in Attention 


max_elem_leng 


thMaximum number of inferred characters 


in_max_len 


The size of the input image 


loc_type 


Input of the output cell coordinate branch1: Only the hidden layer after Attention is used 2: Fuse 
CNN + Attention features 


The code is as follows: 


# https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/modeling/heads/table_ 


satt_head. 


PY 


from paddle import nn 
import paddle.nn.functional as F 


from ppocr. 


modeling.heads.table_att_head import AttentionGRUCell 


class TableAttentionHead (nn.Layer) : 
def init__(self, 


super (TableAttentionHead, 


Sen 
sel 


in_channels, 

hidden_size, 

loc_type=2, 

in_max_len=488, # The size of the input image 
max_elem_length=800, # Maximum number of output labels 
**xkwargs) : 


seit). init () 
lf.input_size in_channels[-1] 
lf. hidden_size = hidden_siz 


sel 
siod 


sel 


se 


lf.structure_generator 


lf.elem_num = 30 
lf.max_elem_length = max_elem_length 


lf.structure_attention_cell 
self.input_siz hidden_siz 


AttentionGRUCell ( 
, self.elem_num, 
nn.Linear (hidden_size, 


use_gru=False) 
self.elem_num) 


, 


sis 


lf.loc_type loc_type 
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self.in_max_len = in_max_len 


# Coordinate box regression branch 


if self.loc_type == 1: 

self.loc_generator = nn.Linear(hidden_size, 4) 
else: 

if self.in_max_len == 640: 


# 640 The size of the last feature map passed through the backbone is. 
«20*20, so the size of the input feature map here is 400 
self.loc_fea_trans = nn.Linear(400, self.max_elem_length + 1) 
elif self.in_max_len == 800: 
# 800 The size of the last feature map after the backbone is 23%*25,.W. 
so the size of the input feature map here is 625 
self.loc_fea_trans = nn.Linear(625, self.max_elem_length + 1) 
elif self.in_max_len == 488: 
# 800 The size of the last feature map passed through the backbone is. 
416*16, so the size of the input feature map here is 256 
self.loc_fea_trans = nn.Linear(256, self.max_elem_length + 1) 
self.loc_generator = nn.Linear(self.input_siz P haddenws ze, °4)) 


def _char_to_onehot (self, input_char, onehot_dim): 
input_ont_hot = F.one_hot (input_char, onehot_dim) 
return input_ont_hot 


def forward(self, inputs, targets=None) : 
# Take out the smallest map output by the backbone 


fea = inputs[-1] 

if len(fea.shape) == 3: 
pass 

else: 


# Reshape B,C,H,W as B,C,H*W 
last_shape = int (np.prod(fea.shape[2:]) ) 
fea = paddle.reshape(fea, [fea.shape[0], fea.shape[1], last_shape] ) 
# Change B,C,W into B,W,C 
fea = fea.transpose([0, 2, 1]) 
batch_size = fea.shape[0] 


hidden = paddle.zeros((batch_size, self.hidden_size) ) 
output_hiddens = [] 
if self.training and targets is not None: 

structure = targets[0] 

for i in range(self.max_elem_length + 1): 


lem_onehots = self._char_to_onehot ( 
structure[:, i], onehot_dim=self.elem_num) 
(outputs, hidden), alpha = self.structure_attention_cell ( 


hidden, fea, elem_onehots) 
output_hiddens.append (paddle.unsqueeze (outputs, axis=1) ) 
output = paddle.concat (output_hiddens, axis=1) 


structure_probs = self.structure_generator (output) 
if self.loc_type == 
loc_preds = self.loc_generator (output) 
loc_preds = F.sigmoid(loc_preds) 
else: 
loc_fea = fea.transpose([0, 2, 1]) 
loc_fea = self.loc_fea_trans (loc_fea) 
loc_fea = loc_fea.transpose([0, 2, 1]) 
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loc_concat = paddle.concat ([output, loc_fea], axis=2) 
loc_preds = self.loc_generator (loc_concat) 


loc_preds = F.sigmoid(loc_preds) 
else: 
temp_elem = paddle.zeros(shape=[batch_size], 
structure_probs = None 
loc_preds = None 
lem_onehots = None 
outputs = None 
alpha = None 


dtype="int32") 


max_elem_length = paddle.to_tensor (self.max_elem_length) 


i= 0 
# Attention forward 
while i < max_elem_length + 1: 
lem_onehots = self._char_to_onehot ( 
temp_elem, onehot_dim=self.elem_num) 


(outputs, hidden), alpha = self.structure_attention_cell ( 


hidden, fea, elem_onehots) 


output_hiddens.append (paddle.unsqueeze (outputs, axis=1) ) 


structure_probs_step 


self.structure_generator (outputs) 


temp_elem = structure_probs_step.argmax(axis=1, dtype="int32") 


i += 1 


output = paddle.concat (output_hiddens, axis=1) 


print ('Attention output shape',output.shape) 
# Table structure branch 


structure_probs = self.structure_generator (output) 


structure_probs = F.softmax(structure_probs) 


# Cell coordinate branch 
if self.loc_type == 


loc_preds = self.loc_generator (output) 
loc_preds = F.sigmoid(loc_preds) 
else: 

# Change B,W,C into B,C,W 
loc_fea = fea.transpose([0, 2, 1]) 
loc_fea = self.loc_fea_trans (loc_fea) 
loc_fea = loc_fea.transpose([0, 2, 1]) 
loc_concat = paddle.concat ([output, loc_fea], axis=2) 
loc_preds = self.loc_generator (loc_concat) 
loc_preds = F.sigmoid(loc_preds) 

return {'structure_probs': structure_probs, 'loc_preds': loc_preds} 


# Initialize the head 


head = TableAttentionHead (in_channels=backbone.out_channels, hidden_size=256,loc_ 


otype=2) 
head.eval () 

# Load the head parameter 
head. set_state_dict (head_dict) 


# Execute the head 

Print (+ '410), "head onward shape) <0) 
head_out = head(backbone_out) 

Print (YAO head  outsshapery nt lali@)) 
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# Print the head output and the corresponding shape 
for key in head_out: 
print (key, head_out [key] . shape) 


KaKK ARK KK KK head forward shape KAKKKKKKKKK 
Attention output shape [1, 801, 256] 
KaKKKK KK KK head out shape KaKKKKKKKEK 
structure_probs [1, 801, 30] 

loc_preds [1, 801, 4] 


Post-processing 


The dictionary file for post-processing is ppocr/utils/dict/table_structure_dict.txt 


The idea of post-processing decoding: 


(continued from previous page) 


1. Perform CTC decoding on structure_probs: remove background characters sos and eos, and take only one of the 


consecutively repeated characters. 


2. Normalize the output coordinates to 0-1, multiply the coordinates by the width and height of the image, and decode 


them to the image space 


# https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/postprocess/rec_ 


opostprocess.py#L441 


from ppocr.postprocess.rec_postprocess import TableLabelDecode 


def post_process (out): 


character_dict_path = 'ppocr/utils/dict/table_structure_dict.txt' 


# Initialize the post-processing of op 
post_op = TableLabelDecode (character_dict_path) 


post_result = post_op (out) 


Sierra bles Sioie Jaks = jofoysic idelsibllic || Meer bieiebias eiesa: Alalisye || 


# The normalized coordinates are restored to the size of the original image size 


res_loc = post_result['res_loc'] 
imgh, imgw = img.shape[0:2] 
res_loc_final = [] 
for rno in range(len(res_loc[0])): 
0 WA, acl. Sik — = ieeisy_ilkeye?| (16) [aigtoy]| 
left = max(int(imgw * x0), 0) 
top = max(int(imgh * yQ), 0) 
right = min(int(imgw * x1), imgw - 1) 
bottom = min(int(imgh * y1), imgh - 1) 


res_loc_final.append([left, top, right, bottom] ) 


# Process the structure information 
structure str list = structure str iret (0) 


Siempciewlice. ier Ihasie = | VinienmlsY Vcloeichyey, Viesioilee || sp Giciebiciemras fier seis ar A<// 


stables! Y</bodyor, V</heml>! | 
return structure_str_list,res_loc_final 


structure_str_list,res_loc_final = post_process (head_out) 
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# print the first 10 outputs 
Divine (siteeuC elrsemcits rele Site (as aON|.)) 
joveabone (iarss} Ikoxel seatiqveidh [[9-21(0))| )) 


# Visualize the inference box 
plt.figure (figsize=(24,8) ) 
img_show = img.copy () 
for box in res_loc_final: 
ev2.cectangle(amgeshow,. (box 0, box fll) (boxi2il7, boxis))7) (255), 10i,, 0) 7.2) 
plt.imshow (img_show) 


[i <htmio', <body>"; ‘<table>’, "<thead> 3" “<tre", B<tde | 1</td>"  <tdss) '</td> 
Pale Ue Gotce An] 

MEA Faas NOS 18S) eo Seo ieee ON a(R A eed ee idee Gale alec peo ee Oren LOS ben [ele 
eA S507 Glin, Ab. 225, OB ese ior, 2250 sees Sil. IRA0S 7. 2a 238 Soh. 2637 227 
ay 29M Stl, [S2om aa, S45 Sh lil 


<matplotlib.image.AxesImage at 0x7£87e84392d0> 


Gl! | 
USYTI TEXT 
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8.2.3 Training 


In the training of table recognition, three models need to be trained, including text detection, text recognition, and table 
structure models. For the training of text detection and recognition models, you can refer to the previous sections. Only 
the training of the table structure model is introduced here. 


This section uses the pubtabnet dataset and MobileNetV3 as the table structure model of the backbone network to intro- 
duce how to train, evaluate, and test the table structure model. 


Data Preparation 
This experiment selects the PubTabNet dataset for demonstrate. The sample diagram of the PubTabNet dataset is shown 
in the following figure: 


Part of the PubTabNet dataset has been downloaded in the project and stored in /home/aistudio/data/data1 19702. You 
can run the following command to unzip the the dataset, or download it from https://github.com/ibm-aur-nlp/PubTabNet 


# Unzip the dataset 
! cd /home/aistudio/data/data119702 && tar -xf pubtabnet_val.tar && cd - 
! ls /home/aistudio/data/data119702 


/hnome/aistudio/PaddleOCR 
PubTabNet_2.0.0_val.jsonl pubtabnet_val.tar val 


After running the above command /home/aistudio/data/data119702 there are a folder and a file[?| 


/home/aistudio/data/data119702 
— val/ Folder for image storage 
'— PubTabNet_2.0.0_val.jsonl/ Label information 


The label format of this dataset is 


"filename': PMC5755158_010_01.png, 


an # Image name 
“splat: - trata in pas 
o # Does the. 
simage belong to the training set or the validation set? 
‘imgid': 0, = 
an # The index of the image 
"html': { 
"structure’: {*tokens":+ ['<thead>", "<tre", "<td>"; «.s] bye 
o #HTML string of the table 
‘cell': [ 
{ 
tokens": [et Sat; td, “at, tit, tet, "Pt, Matas Tar, Maye Sabet Meh ijes 
> # Single text in the table 
"bbhox': [x0, yO, xt, yi] uo 
G # The coordinate of au 


oSsingle text in the table 
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Data Preprocessing 


There are requirements for the format and size of the input pictures in the training. Therefore, before the data are input 
into the model, they needs to be preprocessed to make the pictures and labels meet the needs of network training and 
inference. 


The data preprocessing of the table structure model mainly includes: 
¢ DecodeImage, which converts the image into the format of Numpy; 
¢ ResizeTableImage, which resizes the picture and the long sides to a specified size, and equally scales the short sides; 
¢ TableLabelEncode, which parses the label information in the label file and save it in a unified format; 


¢ NormalizeImage, which changes the input value distribution of any neuron in each layer of the neural network to a 
standard normal distribution with a mean value of 0 and a variance of 1,by normalization, so that the optimization 
of the optimal solution will obviously become smooth and the training is easier to converge; 


PaddingTableImage, which pads the short sides of the image to the size same as the long sides 


¢ ToCHWImage, where the image data format is [H, W, C] (height, width, channel number), and the training data 
format used by the neural network is [C, H, W], so the image data needs to be rearranged such as changing [224, 
224, 3] into [3, 224, 224]; 


¢ KeepKeys, which filters the dict 
TableLabelEncode 


To analyze the label information in the label file, first load the label data and take out a label 


# Load a piece of data in the dataset 
import json 
from pprint import pprint 
with open('/home/aistudio/data/data119702/PubTabNet_2.0.0_val.jsonl', "rb") as f: 
data_lines = f.readlines() 
for line in data_lines: 
data_line = line.decode('utf-8').strip("\n") 
info = json.loads (data_line) 
break 


Run the following code to observe the comparison before and after the TableLabelEncode label encoding. 


from ppocr.data.imaug import TableLabelEncode 

# Initialize the label encoder 

label_eocoder_op = TableLabelEncode (max_text_length=100, # Unused 
max_elem_length=50, # How many cels can be. 

epredicted at most for each picture 
max_cell_num=500, # unused 
character_dict_path='ppocr/utils/dict/table_ 


Yoga bleieibhatey chk che Aiesac)) 

# Construct input data 

cells = info['html']['cells'] 

structure = info['html']['structure'] 

# 2. Print the label before decoding 

print ("The cells and structure before decode") 


joigalique (Meryl ileye Mp. eter less) 
Print (WSERUCEUReH 7 Sisructuire:) 
image = cv2.imread(os.path.join('/home/aistudio/data/data119702/val', info['filename 


eae) 
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data = {'image':image,'cells': cells, 'structure':structure} 
# Execute label encoder 

data = label_eocoder_op (data) 

# Print the encoded information 

print ("The bbox_list and structure after decode") 

jon@aligic (Molo re laksa prckeyeer|| Moloyey:< ILalisic "]) aiciolkskene {()) )} 

joigalique ((Wisieieibieheliciey 4. foletotetl| Wsjeacblere bles” I], cedbalcne:(()) }) 


will get the following output 


The cells and structure before decod 


cells: [ 
{'tokens': []}, 
{"tokene': ["<b>"; TW; Tet; "aly. ait, Tal “aly Vg'y “Sfos"], Tebox x. 66, 4. 967% 
613] }; 
{#tokens’: ['<bs", Tw? tet: let; Tes 8 Y) FT, 1Sly 1e/be"]>y. “oboxts (3h; 4, 160,65 
413] } 
} 
structure: {"tekens": ['<thead>', "<tr>", '<tdS"). <c, </tde"; "</tr>", '™/ Stbody>")} 
The bbox_list and structure after decod 
bbox_list: [ 
O..07. Oie0,. 0.07. 0.2.0] > 
0.0, 0.0, O%/0,- 0.0]., 
0.0, 0.0, 0.0, 0.0], 
0.0; 0.0; 0.0, 0.0], 
0.0, 0.0, 0.0, 0.0], 
0.27731093764305115, 0.06779661029577255, 0.40336135029792786, 0.22033898532390594], 
O40. OO, 04-07. 0.00 


structure: (0, 17 2, saucy Sp Be 2S, Oy Oy DO, DOD, OU, Oi] 


Loss Function 


The loss of the model is divided into two parts: 
1. Structure loss: structure loss uses CrossEntropyLoss 
2. Loc loss: loc loss uses MSELoss 


The two losses are fused by weighting, and the total losses of structure_weight and loc_weight are 100 and 10000 respec- 
tively 


total_loss = structure_loss * structure_weight + loc_loss * loc_weight 
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Model Training 


Once the data are processed and the loss function are defined, the model training can start. 


The training is based on the PaddleOCR training engine, in the form of parameter configuration, refer to the configuration 
file of the table recognition model https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/configs/table/table_mv3. 
yml , the network structure parameters are as follows 


Architecture: 
model_type: table 
algorithm: TableAttn 
Backbone: 
name: MobileNetVv3 
scale: 1.0 
model_name: large 

Head: 
name: TableAttentionHead 
hidden_size: 256 
loc_type: 2 
max_text_length: 100 
max_elem_length: 800 
max_cell_num: 500 


The loss function parameters are as follows: 


Loss: 
name: TableAttentionLoss 
structure_weight: 100.0 
loc_weight: 10000.0 


After the configuration is complete, the training can be started by running the following command 


# Configurate the dataset 

# !mkdir -p train_data/table/pubtabnet 

'cd train_data/table/pubtabnet && 1ln -s /home/aistudio/data/data119702/PubTabNet_2.0. 
o0_val.jsonl PubTabNet_2.0.0_train.jsonl \ 

&& ln -s /home/aistudio/data/data119702/PubTabNet_2.0.0_val.jsonl PubTabNet_2.0.0_val. 
Saison \ 

&& ln -s /home/aistudio/data/data119702/val train \ 

&& ln -s /home/aistudio/data/data119702/val val 


in: failed to create symbolic link 'PubTabNet_2.0.0_train.jsonl': File exists 


! python tools/train.py -c configs/table/table_mv3.yml -o Global.use_gpu=False Global. 
oprint_batch_step=1 Train.loader.batch_size_per_card=1 Eval.loader.batch_size_per_ 
scard=1 


During the training process, the following log will be output 


[2021/12/26 19:57:29] root INFO: train with paddle 2.2.1 and device CPUPlace 
[2021/12/26 19:57:29] root INFO: Initialize indexs of datasets:train_data/table/ 
opubtabnet/PubTabNet_2.0.0_train.jsonl 

[2021/12/26 19:57:29] root INFO: Initialize indexs of datasets:train_data/table/ 
opubtabnet/PubTabNet_2.0.0_val.jsonl 

[2021/12/26 19:57:29] root INFO: train from scratch 
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2021/12/26 19:57:29] root INFO: train dataloader has 9115 iters 

0211/12/26 19:57:29] root INFO: valid dataloader has 9115 iters 

0211/12/26 19:57:29] root INFO: During the training process, after the Oth iteration, 
an evaluation is run every 400 iterations 

0211/12/26 19:57:29] root INFO: Initialize indexs of datasets:train_data/table/ 
pubtabnet/PubTabNet_2.0.0_train.jsonl 

021/12/26 19:57:47] root INFO: epoch: [1/400], iter: 1, lr: 0.001000, loss: 358. 
711182, structure_loss: 277.904785, loc_loss: 80.806374, acc: 0.000000, reader_ 
cost: 0.05254 s, batch_cost: 17.39120 s, samples: 2, ips: 0.11500 

021/12/26 19:57:55] root INFO: epoch: [1/400], iter: 2, lr: 0.001000, loss: 353. 
4381165, structure_loss: 208.200623, loc_loss: 137.825607, acc: 0.000000, reader_ 
scost: 0.00041 s, batch_cost: 8.65134 s, samples: 1, ips: 0.11559 


Ne) 


dS £ bw 


woe ££ 


Model Evaluation 


During the training process, two models are saved by default: one is the latest trained model named latest, and the other is 
the model named best_accuracy with the highest accuracy. Next, use the saved model parameters to evaluate the accuracy 
of the test set: 


The accuracy evaluation code of the table structure model is located in PaddleOCR/ppocr/metrics/table_metric.py. Call 
tools/eval .py to evaluate the accuracy of the trained model. 


!python tools/eval.py -c configs/table/table_mv3.yml -o Global.checkpoints=/home/ 
saistudio/PaddleOCR/pre_train/en_ppocr_mobile_v2.0_table_structure_train/best_ 
saccuracy Global.use_gpu=False Eval.loader.batch_size_per_card=1 


The following log will be output during the evaluation process, the data set is too large, here we manually terminate the 
process. 


0211/12/26 20:00:08] root INFO: train with paddle 2.2.1 and device CPUPlace 
0211/12/26 20:00:08] root INFO: Initialize indexs of datasets:train_data/table/ 
pubtabnet /PubTabNet_2.0.0_val.jsonl 

0211/12/26 20:00:08] root INFO: resume from /home/aistudio/PaddleOCR/pre_train/en_ 
ppocr_mobile_v2.0_table_structure_train/best_accuracy 

021/12/26 20:00:08] root INFO: metric in ckpt ******kkKeKK KKK 

021/12/26 20:00:08] root INFO: acc:0.7380142622051563 

0211/12/26 20:00:08] root INFO: fps:8.360272547972942 

0211/12/26 20:00:08] root INFO: best_epoch:7 

021/12/26 20:00:08] root INFO: start_epoch:8 

eval model:: 0% | | 2/9115 [00:07<8:55:26, 3.53s/it]*C 


DS: 2 ior oN 


BS: (iD DOE 


Nh 


Model Inference 


After training the model, you can also use the saved model to perform model inference on a single picture or an image in 
a folder, and observe the inference result. 


! python tools/infer_table.py -c configs/table/table_mv3.yml -o Global. 
~checkpoints=pretrained_models/en_ppocr_mobile_v2.0_table_structure_train/best_ 
saccuracy Global.infer_img=doc/table/table.jpg Global.use_gpu=False 


The structure information and cell box information of the table are output during the inference process. 
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[2021/12/26 20:00:22] root INFO: train with paddle 2.2.1 and device CPUPlace 
[2021/12/26 20:00:22] root INFO: resume from /home/aistudio/PaddleOCR/pre_train/en_ 
oppocr_mobile_v2.0_table_structure_train/best_accuracy 

[2021/12/26 20:00:22] root INFO: infer_img: /home/aistudio/1.jpg 

[2021/12/26 20:00:26] root INFO: result: ['<thead><tr><td></td><td></td><td></td><td> 
o</td><td></td></tr></thead><tbody><tr><td></td><td></td><td></td><td></td><td></td> 
Hw /EE> oes /tbody>"], (132, 9, 104, 401, (232) 8, 307, 41), [429, 7, S00, 44), [559 a5 
48, 656, 44], [715, 7, 780, 44], [37, 45, 99, 73], [190, 44, 342, 74], [432, 45, 502, 
o 74], [565, 44, 655, 73], [712, 46, 777, 74], [30, 81, 101, 109], [202, 80, 337,u. 
elio),. (433, 82, S038, 217), (S78; 83, 638, 110], (698). 82, 790, 120), [34, 129, 10444 
#148], (197, 116, 347, 147], [443, 117, 492, 148], [572, 118, 643, 147], [698, 118, . 
6797, 147], [35, 154, 101, 183], [199, 152, 342, 184], [436, 154, 501, 184], [558,. 
4155, 670, 184], [701, 153, 801, 183], [40, 188, 93, 217], [217, 187, 314, 219],...,4 
4[712, 477, 778, 504] ] 

[2021/12/26 20:00:26] root INFO: success! 


8.2.4 Summary 


This section introduces the principle of PP-Structure table recognition algorithm in PaddleOCR, and the process of the 
table structure model from data processing to the end of training. 


8.2.5 Assignment 


https://aistudio. baidu.com/aistudio/education/objective/287 1 1 


8.3 DOC-VQA SER Practice 


This section will introduce how to use PaddleOCR to train and run the DOC-VQA SER algorithm, including: 
1. Understanding the principle of DOC-VQA SER algorithm 
2. Mastering the training process of DOC-VQA SER code in PaddleOCR 


8.3.1 Quick Start 


Prepare the code and the environment 


# clone PaddleOCR code 
! git clone -b release/2.4 https://gitee.com/paddlepaddle/PaddleOCR 


# Install dependencies 
! pip install -r PaddleOCR/requirements.txt 
! pip install paddleocr 


# Install dependencies 
! pip install yacs gnureadline paddlenlp==2.2.1 


# Change to the vga directory 
import os 
os.chdir('PaddleOCR/ppstructure/vqa') 
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# Download the model 

! mkdir pretrained_models 

# Download SER model and unzip it 

! wget -P ./pretrained_models/ https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1. 
O_ser_pretrained.tar && cd pretrained_models && tar xf PP-Layout_v1i.0_ser_ 

oepretrained.tar && cd 


# Perform SER inference 
# https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.4/ppstructure/vqa/infer_ 


oser_e2e.py 


! python infer_ser_e2e.py \ 
--model_name_or_path "./pretrained_models/PP-Layout_v1.0_ser_pretrained/" \ 
-—-max_seq_length 512 \ 
——olblicjswic role Weyonejeibhe etait eyey/W 
--infer_imgs "images/input/zh_val_42.jpg" 


aliqeroueic, (hy 
from matplotlib import pyplot as plt 
Smatplotlib inline 


img = cv2.imread('output/res_e2e/zh_val_42_ser.jpg') 
plt.figure (figsize=(48,24) ) 
plt.imshow (img) 


process: [0/1], save result to output/res_e2e/zh_val_42_ser.jpg 
Corrupt JPEG data: premature end of data segment 


<matplotlib.image.AxesImage at 0x7fa228317cd0> 
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8.3.2 Explanation of The Principle 

algorithms of the DOC-VQA series in PaddleOCR are currently implemented based on the LayoutXLM paper, providing 
two tasks: SER and RE 

LayoutXLM is a multi-language version of LayoutLMV2. The schematic of LayoutLMV?2 is as follows: 


Compared with Bert in NLP, LayoutXLM adds the layout information of the text in the image and image features to the 
input of the model. LayoutXLM has been implemented in PaddleNLP, so here we introduce the data and network from 


the perspective of the model forward. 


Input Data Processing 
First, perform OCR or pdf analysis on the image, obtain texts and bbox information, and build the three inputs of the 
model on this basis: 


1. Text Embedding 


First, use WordPiece to segment the text recognized by OCR, then add [CLS] and [SEP] tags, and use [PAD] to 
fill in the length to get the input sequence of the text: 


S = {[CLS], w,,w>,-~, [SEP], [PAD], [PAD],--},|S] = L 


Then add the word vector, the one-dimensional position vector, and the segmented vector to get the text vector, the 
formula is as follows: 


t; = TokEmb(w,) + PosEmb1D(i) + SegEmb(s;),0<i< L 
One-dimensional position vector: the index of the word 


Segmented vector: A 
# Text Embedding demo 


from paddlenlp.transformers import Layout XLMTokenizer 


tokenizer = LayoutXLMTokenizer.from_pretrained('pretrained_models/PP-Layout_v1.0_ser_ 
opretrained') 

# Tokenization 

print ('RRRRE', tokenizer.tokenize ('PRRRE') ) 

# The result of tokenization 
print ('PRRRRRR', tokenizer.encode (‘PPE ') ) 


elle [et ee NEGA » | 
PRRERRAE ~"rnputotds": (0; Usi29, 84072, L801, 2), token type ads": ['0, 0, 0). ‘O7ns 
30] 


1. Image Embedding 


We use the ResNeXt-FPN network as the image encoder. First, extract the feature map of the original document 
image, then average it into a fixed size (B * 256 * 7 * 7), and expand the feature map in row (B * 256 * 49) to get 
the characteristic sequence corresponding to the image after linear projection (B * 49 * 256). Corresponding to the 
composition of the text vector, the image vector is also supplemented with one-dimensional relative position and 
segmentation information. Finally, add the feature vector, the one-dimensional position vector, and the segmented 
vector to get the image vector: 


vu, = Proj(VisTokEmb(I),) + PosEmb1D(i) + SegEmb([C]),0<i<WH 


Segmented vectorIC 
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2. Layout Embedding 


For the coordinate range of each word or image region on the page, a bounding box parallel to the coordinate axis 
is used to represent the layout information, and each bounding box is represented by 4 boundary coordinate values, 
width, and height. The layout vector is obtained by concatenating the vectors corresponding to the 6 features: 


I, = Concat(PosEmb2D,, (19, 21, w), PosEmb2D,(yo,91;h)),0<i1<WH+L 


The following demonstrates the process of constructing a network input from an input image. The whole process mainly 
includes the following steps: 


1. Performing OCR recognition on the image 

2. Preprocessing the image, including scaling to a particular size and normalization 
3. Tokenizing the recognized text and converting it into the index 

4. Normalizing the text box and keeping its value between 0-1000 


5. Padding the result after processing 3 and 4 to facilitate batch grouping 
# Construct the input of inference 


# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppstructure/vgqa/vga_ 
ey Uie Tel Sie Ns 


import cv2 

import numpy as np 

import paddle 

from copy import deepcopy 

from paddleocr import PaddleOCR 

from paddlenlp.transformers import Layout XLMTokenizer 


from infer_ser_e2e import trans_poly_to_bbox, pad_sentences, split_page 


def parse_ocr_info_for_ser(ocr_result): 
# The OCR result is converted into the format of the dictionary, and the text box. 


ois converted to a bounding rectangle 
ocr into = [|] 
for res in ocr_result: 
ocr_info.append ({ 
Wiest Meme sy (elem ON ey 
UbbOsi -satsrans ap OlvmtOmbbox (me siOlt) i. 
Ufeyoulyl resist (10) 
}) 


return ocr_info 


def preprocess ( 
tokenizer, 
Osea shame), 
foxelia aliguiey, 
img_size=(224, 224), 
pad_token_label_id=—100, 
max_seq_len=512, 
add_special_ids=False, 
return_attention_mask=True, ): 
ocr_info = deepcopy (ocr_info) 
height = ori_img.shape[0] 
width = ori_img.shape[1] 


(continues on next page) 
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# Resize the image to a particular shape 


(continued from previous page) 


img = cv2.resize(ori_img, img_size).transpose([2, 0, 1]).astype(np.float32) 

segment_offset_id = [] # Store the ending position of each text in input_ids 

bbox_list = [] # Store the box normalized to 0-1000 

input_ids_list = [] # Store the index of the tokenized text segment in the. 
evocabulary 

token_type_ids_list = [] # Store the category information of the text segment 


for) Inke an eoCcratntor 
# Normalize the box to 0-1000 
# x1, al Se. Wee 


bbox = info["bbox"] 

bbox[0] = int (bbox[0] * 1000.0 / width) 
bbox[2] = int (bbox[2] * 1000.0 / width) 
bhox [y= ant(bboxpl]) = oo0n0, ) hedghte) 
bbox[3] = int (bbox[3] * 1000.0 / height) 


# Tokenizer the text information, including tokenization and conversion to. 


othe index in the vocabulary 
text = info["text"] 
ncode_res = tokenizer.encode ( 


text, pad_to_max_seq_len=False, return_attention_mask=True) 


# Decide whether to delete special characters according to the parameters 


if not add_special_ids: 
# TODO: use tok.all_special_ids to remove 


encode_res["input_ids"] = encode_res["input_ids"] [1:-1] 
ncode_res["token_type_ids"] = encode_res["token_type_ids"] [1:-1] 
ncode_res["attention_mask"] = encode_res["attention_mask"] [1:-1] 


input_ids_list.extend(encode_res ["input_ids"]) 
token_type_ids_list.extend(encode_res["token_type_ids"]) 
bbox_list.extend([bbox] * len(encode_res["input_ids"]) ) 
segment_offset_id.append(len(input_ids_list) ) 


encoded_inputs = { 
Usuajeyolie—ayolks “akiqyorolie. akoks) JLabehc 
"token_type_ids": token_type_ids_list, 
UNO R Islorepe dbabishe 
Waltatenielhonmmeasikn a. |i) sexs semi (inp uitemtc Sams) y, 


} 


# Pad the val to a particular length, and use 0 to supplement the length if it isi 


onot enough 
encoded_inputs = pad_sentences ( 
tokenizer, 
encoded_inputs, 
max_seq_len=max_seq_len, 
return_attention_mask=return_attention_mask) 


# input_ids> 512, divide into 2 batches 
ncoded_inputs = split_page (encoded_inputs) 


fake_bs = encoded_inputs["input_ids"].shape[0] 


encoded_inputs["image"] = paddle.to_tensor (img) .unsqueeze (0) 


. expand ( 


(continues on next page) 
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[fake_bs] + list (img.shape) ) 
encoded_inputs["segment_offset_id"] = segment_offset_id 
return encoded_inputs 


img = cv2.imread('images/input/zh_val_42.jpg') 


ocr_engine = PaddleOCR(use_angle_cls=False, show_log=False) 
# Perform OCR recognition 
ocr_result = ocr_engine.ocr(img, cls=False) 


# The OCR result is converted into the format of the dictionary, and the text box isu 
«converted to a bounding rectangle 
ocr_info = parse_ocr_info_for_ser(ocr_result) 


tokenizer = LayoutXLMTokenizer.from_pretrained('pretrained_models/PP-Layout_v1.0_ser_ 
opretrained') 

# Resize the image, 

# Tokenize the text, and convert it to the dictionary index and so on, 

# Normalize the box 

max_seq_length = 512 

inputs = preprocess (tokenizer=tokenizer, ori_img=img, ocr_info=ocr_info,max_seq_len=max_ 
sseq_length, img_size=(224, 224) ) 


print (inputs.keys () ) 
print (inputs['image'].shape) 


Corrupt JPEG data: premature end of data segment 


dict_keys(['input_ids', 'token_type_ids', 'bbox', 'attention_mask', 'image', 
o'segment_offset_id']) 
(2, 37 224, 224] 


The processed data are a dictionary which contains the following fields: 


Field Meaning 

image Image resized in 224*224 

bbox Box normalized to 0-1000 

input_ids The index of the tokenized text segment in the vocabulary 

to- Category information of the text segment 

ken_type_ids 

atten- It masks the text segment, marks the corresponding position of the special character as 0, and that 
tion_mask of the text segment as 1. 

seg- Record the ending position of each text in input_ids 

ment_offset_id 
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SER Network 


SER refers to Semantic Entity Recognition, which can recognize and classify texts in images. A fully connected classifi- 
cation head is added to the output of the SER network LayoutkXLMModel, and the code is as follows: 


# https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/ 
olayoutxlm/modeling.py#L846 


from paddlenlp.transformers import LayoutXLMPretrainedModel 
from paddle import nn 
class LayoutXLMForTokenClassification (LayoutXLMPretrainedModel) : 


def init__(self, layoutxlm, num_classes=2, dropout=None) : 
super (LayoutXLMForTokenClassification, self).__init__() 
self.num_classes = num_classes 


if isinstance(layoutxlm, dict): 
self.layoutxlm = LayoutXLMModel (**layoutxlm) 
else: 
self.layoutxlm = layoutxlm 
self.dropout = nn.Dropout (dropout if dropout is not None else self.layoutxlm. 
oconfig["hidden_dropout_prob"] ) 
self.classifier = nn.Linear(self.layoutxlm.config["hidden_size"],num_classes) 
self.classifier.apply(self.init_weights) 


def get_input_embeddings (self): 
return self.layoutxlm.embeddings.word_embeddings 


def forward(self, input_ids=None, bbox=None, image=None, attention_mask=None, — 
token_type_ids=None, position_ids=None, head_mask=None, labels=None) : 

# Calculate backbone 

outputs = self.layoutxlm(input_ids=input_ids, bbox=bbox, image=image,.. 
sattention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ 
osids, head_mask=head_mask) 

seq_length = input_ids.shape[1] 

# Calculate head 


sequence_output, image_output = outputs[0][:, :seq_length], outputs[0][:, seq_ 
oelength: ] 

sequence_output = self.dropout (sequence_output) 

logits = self.classifier (sequence_output) 


outputs = logits, 


# Calculate loss 
if labels is not None: 
loss_fct = nn.CrossEntropyLoss() 


if attention_mask is not None: 
a 


active_loss = attention_mask.reshape([-1, ]) == 1 
active_logits = logits.reshape([-1, self.num_classes]) [active_loss] 
active_labels labels.reshape([-1, ]) [active_loss] 
loss = loss_fct(active_logits, active_labels) 
else: 
loss = loss_fct(logits.reshape([-1, self.num_classes]),labels. 
oreshape([-1, ])) 
outputs = (loss, ) + outputs 


return outputs 
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# Initialize the network 
net = LayoutXLMForTokenClassification.from_pretrained('pretrained_models/PP-Layout_v1. 
o0_ser_pretrained') 


net.eval () 
# Perform network forward 
outputs = net (input_ids=inputs["input_ids"], 


bbox=inputs["bbox"], 

image=inputs["image"], 

token_type_ids=inputs["token_type_ids"], 

attention_mask=inputs["attention_mask"]) 
print (outputs[0].shape) 


Post-Processing 
Post-processing mainly matches inferred results of the text segment output by the model and the text, and combines the 
inferred result with the OCR result. It mainly includes the following steps: 

1. For each text, count the inferred labels of all text segments under the text; 


2. Select the label with the most inferences of all text segments as the label of the text. 


# https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.4/ppstructure/vgqa/vga_ 
outils.py 


import paddle 
import numpy as np 


from infer_ser_e2e import get_bio_label_maps 
label2id_map, id2label_map = get_bio_label_maps('labels/labels_ser.txt') 


def postprocess(attention_mask, preds, id2label_map): 
if isinstance(preds, paddle.Tensor): 
preds = preds.numpy () 
preds = np.argmax(preds, axis=2) 


preds_list = [[] for _ in range(preds.shape[0]) ] 


# keep batch info 
for i in range(preds.shape[0]): 
for j in range(preds.shape[1]): 
if attention_mask[i][j] == 
preds_list [i] .append(id2label_map[preds[i] [j]]) 


return preds_list 


def merge_preds_list_with_ocr_info(ocr_info, segment_offset_id, preds_list, 
label2id_map_for_draw): 


# list flatten 
preds = [p for pred in preds_list for p in pred] 


# Convert the dictionary of label2idx to the field of idx2label, and remove thew 
oprefixes of B- and I- 


(continues on next page) 


262 Chapter 8. Document Analysis Technology 


Dive into OCR 


(continued from previous page) 


id2label_map = dict () 


for key in label2id_map_for_draw: 

val = label2id_map_for_draw[key] 

if key == "0": 
id2label_map[val] = key 

if key.startswith("B-") or key.startswith("I-"): 
id2label_map[val] = key[2:] 

else: 
id2label_map[val] = key 


print ("id2label_map:",id2label_map) 


# Count the inferred label of each text 
for idx in range(len(segment_offset_id)): 
if idx == 
start_id = 0 
else: 
start_id = segment_offset_id[idx - 1] 


nd_id = segment_offset_id[idx] 

# Take out the range of text in the output 

curr_pred = preds[start_id:end_id] 

# Take out all the inference results of the text in the output 
curr_pred = [label2id_map_for_draw[p] for p in curr_pred] 


if len(curr_pred) <= 0: 

pred_id = 0 
else: 

# print ("pred label:", curr_pred) 

# Count the label 

counts = np.bincount (curr_pred) 

ip demesne (Meteiiioen) 4 Cleaners) 

pred_id = np.argmax (counts) 
ocr_info[idx] ["pred_id"] = int (pred_id) 
ocr_info[idx] ["pred"] = id2label_map [int (pred_id) ] 
# print ("pred label:",id2label_map[int (pred_id) ]) 

return ocr_info 


preds = postprocess(inputs["attention_mask"], outputs[0], id2label_map) 


# Replace the value label whose initial is I with that beginning with B 


label2id_map_for_draw = dict () 
for key in label2id_map: 
if key.startswith("I-"): 
label2id_map_for_draw[key] 
else: 
label2id_map_for_draw[key] = label2id_map[key] 
print ("label2id_map:", label2id_map) 
print ("label2id_map_for_draw:", label2id_map_for_draw) 
# Combine the forecast information and the OCR information 


label2id_map["B" + key[1:]] 


ocr_info_with_ser = merge_preds_list_with_ocr_info(ocr_info, inputs["segment_offset_id 


o"], preds, label2id_map_for_draw) 
print (ocr_info_with_ser) 


Part of the output information is as follows 
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label2id_map: {'O': 0, 'B-QUESTION': 1, 'I-QUESTION': 2, 'B-ANSWER': 3, 'I-ANSWER': 4, 
« 'B-HEADER': 5, 'I-HEADER': 6} 

label2id_map_for_draw: {'O': 0, 'B-QUESTION': 1, 'I-QUESTION': 1, 'B-ANSWER': 3, 'I- 
oANSWER': 3, 'B-HEADER' 5, “‘I=HEADER’: 5} 

id2label_map: {0: 'O', 1: 'QUESTION', 3: 'ANSWER', 5: 'HEADER'} 

result: [ 

{'text': 'PBRRRRRM', 'bbox': [1026.0, 292.0, 1495.0, 377.0], 'poly': [[1027.0, 292.0],u 
o[1495.0, 300.0], [1494.0, 377.0], [1026.0, 369.0]], 'pred_id': 5, 'pred': 'HEADER'}, 
{'text': 'RBRRRRRRRE', 'bbhox': [207.0, 424.0, 587.0, 475.0], 'poly': [[207.0, 424.0],u. 
4[587.0, 424.0], [587.0, 475.0], [207.0, 475.0]], 'pred_id': 1, 'pred': 'QUESTION'}, 

{"text': "RR", ‘bbox": [1144.0, 526.0, 1218.0, 566.0], “poly’: [[21144.0, 526.0] ;. 
4[1218.0, 526.0], [1218.0, 566.0], [1144.0, 566.0]], 'pred_id': 1, ‘pred’: 'QUESTION 


ey! } 


8.3.3 Training 


This section takes the XFUN Chinese dataset as an example to introduce how to train the evaluate, and test the SER 


model. 


Data Preparation 


Here, the XFUN dataset is used as the experimental dataset. It is a multilingual dataset proposed by Microsoft for KIE 
tasks. It contains seven datasets, each of which has 149 training sets and 50 validation sets. 


¢ ZH (Chinese) 
e JA (Japanese) 
ES (Spain) 
FR (French) 
IT (Italy) 


DE (German) 


PT (Portuguese) 


This experiment uses the Chinese dataset for demonstration and the French dataset for the practical course. Samples of 


the datasets are shown in the figure below 


You can run the following command to download and unzip the Chinese dataset, or download it from https://github.com/ 


doc-analysis/X FUND. 


! wget https://paddleocr.bj.bcebos. 


! tar —xf XFUND.tar 


com/dataset/XFUND.tar 


# Use the following code to convert other datasets of XFUN 


# https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2. 4/ppstructure/vgqa/helper/ 


SEranse <nuneGdaical Py; 


File ‘XFUND.tar’ already there; 


not retrieving. 
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After that, there are 2 folders in the /home/aistudio/PaddleOCR/ppstructure/vqa/XFUND, and the directory structure is 
as follows: 


/hnome/aistudio/PaddleOCR/ppstructure/vqa/XFUND 


'— zh_train/ training set 
image/ image storage folder 
t— xfun_normalize_train.json label information 


'— zgh_val/ verification set 
image/ image storage folder 
-K— xfun_normalize_val.json label information 


The label format of this dataset is: 


"height": 3508, # Image height 
"width": 2480, # Image width 
"oer into": )\[ 


{ 


"text": "PRR: ", # Text content 

"label": "question", # The category of the text 
"bbox": [261, 802, 483, 859], # Text box 

wads 54, # Text Index 


"linking": [[54, 60]], # The relationship between the current text and. 
sother texts [question, answer] 
*words”™: [] 
}y 
{ 
Weextms. "2 AR", 
"label": "answer", 
"bbox": [487, 810, 862, 859], 
"id": 60, 
"linking": [[54, 60]], 
"words": [] 


Loss Function 


Since it is a classification problem, choose CrossEntropyLoss as the loss function 


Model Training 


After processing the data processing and defining the loss function, you can start to train the model. 


The commands are as follows: 


python train_ser.py \ 


--model_name_or_path "layoutxlm-base-uncased" \ 
ser_model_type "LayoutXLM" \ 
—Acieclaligy foksucel _(clikie Wr CMUINIDy/eslay (ererslakioy//akintevotsi? \\ 
-—-train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ 
--eval_data_dir "XFUND/zh_val/image" \ 
--eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ 
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-—-per_gpu_train_batch_size 8 \ 


—-per_gpu_eval_batch_size 8 \ 


--num_train_epochs 200 \ 


--eval_steps 10 \ 
=—-OutLput Gin "A /output/ser/ 
--learning_rate 5e-5 \ 


--warmup_steps 50 \ 


—-evaluate_during_training \ 
--num_workers 0 \ 


--seed 2048 


\ 


During the training process, the following log will be output 
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12: 
models/layoutxlm-bas 
021-12-26 20:1 

omodels/layoutxlm-bas 
W1226 20:12:07.929606 


2:07] root 
:07] root 
:07] root 
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SSS Se Configuration Arguments 
adam_epsilon: te-08 

det_model_dir: None 

eval_data_dir: XFUND/zh_val/image 
eval_label_path: XFUND/zh_val/xfun_normalize_val.json 
eval_steps: 10 

evaluate_during_training: True 

infer_imgs: None 

label_map_path: ./labels/labels_ser.txt 
learning_rate: 5e-05 

max_grad_norm: 1.0 

max_seq_length: 512 

model_name_or_path: layoutxlm-base-uncased 
num_train_epochs: 200 

num_workers: 0 

ocr_json_path: None 

output_dir: ./output/ser/ 
per_gpu_eval_batch_size: 8 
per_gpu_train_batch_size: 8 
re_model_name_or_path: None 

rec_model_dir: None 

resume: False 

seed: 2048 

ser_model_type: LayoutXLM 

train_data_dir: XFUND/zh_train/image 
train_label_path: XFUND/zh_train/xfun_normalize_ 


warmup_steps: 50 
weight_decay: 0.0 


INFO] - Already cached /home/aistudio/.paddlenlp/ 


uncased/sentencepiece.bpe.model 
INFO] - Already cached /home/aistudio/.paddlenlp/ 


12:07, 928] 


uncased/model_state.pdparams 
1085 device_context.cc:447] Please NOTE: device: 0, GPUuU 


sCompute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 
W1226 20:12:07.933472 1085 device_context.cc:465] device: 0, cuDNN Version: 7.6. 
[2021/12/26 20:12:18] root INFO: train from scratch 
[2021/12/26 20:12:18] root INFO: ***** Running training ***** 
[2021/12/26 20:12:18] root INFO: Num examples = 149 
[2021/12/26 20:12:18] root INFO: Num Epochs = 200 
[2021/12/26 20:12:18] root INFO: Instantaneous batch size per GPU = 8 
[2021/12/26 20:12:18] root INFO: Total train batch size (w. parallel, distributed). 
=u (continues on next page) 
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2021/12/26 20:12:18] root INFO: Total optimization steps = 3800 
2021/12/26 20:12:20] root INFO: epoch: [0/200], iter: [0/19], global_step:1, trainwu 
sloss: 1.983819, lr: 0.000001, avg_reader_cost: 1.32728 sec, avg_batch_cost: 1.49863. 
osec, avg_samples: 8.00000, ips: 5.33822 images/sec 
2021/12/26 20:12:21] root INFO: epoch: [0/200], iter: [1/19], global_step:2, train 
sloss: 1.935008, lr: 0.000002, avg_reader_cost: 0.61179 sec, avg_batch_cost: 0.72010. 
osec, avg_samples: 8.00000, ips: 11.10955 images/sec 
2021/12/26 20:12:23] root INFO: epoch: [0/200], iter: [2/19], global_step:3, trainw 
sloss: 1.957709, lr: 0.000003, avg_reader_cost: 0.75516 sec, avg_batch_cost: 0.85305. 
osec, avg_samples: 8.00000, ips: 9.37815 images/sec 
Corrupt JPEG data: 18 extraneous bytes before marker Oxc4 
2021/12/26 20:12:24] root INFO: epoch: [0/200], iter: [3/19], global_step:4, train 
sloss: 1.842568, lr: 0.000004, avg_reader_cost: 0.76927 sec, avg_batch_cost: 0.86650. 
osec, avg_samples: 8.00000, ips: 9.23258 images/sec 
2021/12/26 20:12:25] root INFO: epoch: [0/200], iter: [4/19], global_step:5, trainw 
sloss: 1.941558, lr: 0.000005, avg_reader_cost: 0.67992 sec, avg_batch_cost: 0.77854. 
osec, avg_samples: 8.00000, ips: 10.27559 images/sec 
2021/12/26 20:12:26] root INFO: epoch: [0/200], iter: [5/19], global_step:6, trainu 
sloss: 1.879326, lr: 0.000006, avg_reader_cost: 0.62112 sec, avg_batch_cost: 0.71867. 
osec, avg_samples: 8.00000, ips: 11.13167 images/sec 
2021/12/26 20:12:27] root INFO: epoch: [0/200], iter: [6/19], global_step:7, trainwu 
sloss: 1.833748, lr: 0.000007, avg_reader_cost: 0.79442 sec, avg_batch_cost: 0.89132. 
asec, avg_samples: 8.00000, ips: 8.97544 images/sec 
2021/12/26 20:12:29] root INFO: epoch: [0/200], iter: [7/19], global_step:8, train 
sloss: 1.747398, lr: 0.000008, avg_reader_cost: 0.74634 sec, avg_batch_cost: 0.84421. 
asec, avg_samples: 8.00000, ips: 9.47633 images/sec 
2021/12/26 20:12:30] root INFO: epoch: [0/200], iter: [8/19], global_step:9, trainwu 
sloss: 1.603032, lr: 0.000009, avg_reader_cost: 0.79887 sec, avg_batch_cost: 0.89827. 
osec, avg_samples: 8.00000, ips: 8.90600 images/sec 
2021/12/26 20:12:31] root INFO: epoch: [0/200], iter: [9/19], global_step:10, trainui 
sloss: 1.678029, lr: 0.000010, avg_reader_cost: 0.78243 sec, avg_batch_cost: 0.88950. 
osec, avg_samples: 8.00000, ips: 8.99385 images/sec 
2021/12/26 20:12:33] root INFO: [Eval]process: 0/7, loss: 1.41839 
2021/12/26 20:12:34] root INFO: [Eval]process: 1/7, loss: 1.60403 
2021/12/26 20:12:35] root INFO: [Eval]process: 2/7, loss: 1.70345 
2021/12/26 20:12:36] root INFO: [Eval]process: 3/7, loss: 1.60751 
2021/12/26 20:12:38] root INFO: [Eval]process: 4/7, loss: 1.49639 
Corrupt JPEG data: premature end of data segment 
2021/12/26 20:12:39] root INFO: [Eval]process: 5/7, loss: 1.66062 
2021/12/26 20:12:39] root INFO: [Eval]process: 6/7, loss: 1.56035 
2021/12/26 20:12:40] root INFO: 
precision recall f1-score support 
ANSWER 0.01 0.01 0.01 1514 
HEADER 0.00 0.00 0.00 58 
QUESTION 0.03 0.02 0.02 1155 
micro avg 0.02 0.01 0.01 2727 
macro avg 0.01 0.01 O01 292) 
weighted avg 0.02 0.04 0.04 2721 
[2021/12/26 20:12:40] root INFO: ***** Eval results ***** 
[2021/12/26 20:12:40] root INFO f1 = 0.013078227173649792 
[2021/12/26 20:12:40] root INFO loss = 1.5786780970437186 
[2021/12/26 20:12:40] root INFO precision = 0.01925820256776034 
[2021/12/26 20:12:40] root INFO recall = 0.009900990099009901 
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[2021/12/26 20:12:44] root INFO: Saving model checkpoint to ./output/ser/best_model 
[2021/12/26 20:12:44] root INFO: [epoch 0/200] [iter: 9/19] results: {'loss': 1. 
+5786780970437186, ‘precision': 0.01925820256776034, 'recall': 0.009900990099009901, 
o'f1': 0.013078227173649792} 

[2021/12/26 20:12:44] root INFO: best metrics: {'loss': 1.5786780970437186, ‘precision 
o': 0.01925820256776034, 'recall': 0.009900990099009901, 'fi': 0.013078227173649792} 


Model Evaluation 


During the training process, two models are saved by default, one is the latest trained model named latest, and the other 
is the most accurate model named best. The folder structure for saving the model is as follows: 


output/ser/ 

-[— best_model 

t— model_config.json # Model configuration 
model_state.pdparams # Model parameters 
sentencepiece.bpe.model # Parameters of tokenizer 

[— tokenizer_config.json # tokenizer configuration 

— training_args.bin # Parameters for starting training 

t[— infer_results.txt 

latest_model 

-— model_config.json 


model_state.pdparams 
sentencepiece.bpe.model 
t— tokenizer_config.json 


— training_args.bin 
> test ot. cet 
t— test_pred.txt 


— train.log # Training log 


Next, use the saved model parameters to evaluate the accuracy on the test set: 


! python eval_ser.py \ 
--model_name_or_path "output/ser/best_model" \ 
ser_model_type "LayoutXLM" \ 
--eval_data_dir "XFUND/zh_val/image" \ 
--eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ 
-—-per_gpu_eval_batch_size 8 \ 
--num_workers 8 \ 
--output_dir "output/ser/" \ 
-—-seed 2048 


The following log will be output during the evaluation process 


2021/12/26 20:13:05] root INFO: ----------- Configuration Arguments 

2021/12/26 20:13:05] root INFO: adam_epsilon: te-08 

2021/12/26 20:13:05] root INFO: det_model_dir: None 

2021/12/26 20:13:05] root INFO: eval_data_dir: XFUND/zh_val/image 

2021/12/26 20:13:05] root INFO: eval_label_path: XFUND/zh_val/xfun_normalize_val.json 
2021/12/26 20:13:05] root INFO: eval_steps: 10 

2021/12/26 20:13:05] root INFO: evaluate_during_training: False 

2021/12/26 20:13:05] root INFO: infer_imgs: None 
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2021/12/26 20:13:05] root INFO: label_map_path: ./labels/labels_ser.txt 
2021/12/26 20:13:05] root INFO: learning_rate: 5e-05 
2021/12/26 20:13:05] root INFO: max_grad_norm: 1.0 
2021/12/26 20:13:05] root INFO: max_seq_length: 512 
2021/12/26 20:13:05] root INFO: model_name_or_path: output/ser/best_model 
2021/12/26 20:13:05] root INFO: num_train_epochs: 3 
2021/12/26 20:13:05] root INFO: num_workers: 8 
2021/12/26 20:13:05] root INFO: ocr_json_path: None 
2021/12/26 20:13:05] root INFO: output_dir: output/ser/ 
2021/12/26 20:13:05] root INFO: per_gpu_eval_batch_size: 8 
2021/12/26 20:13:05] root INFO: per_gpu_train_batch_size: 8 
2021/12/26 20:13:05] root INFO: re_model_name_or_path: None 
2021/12/26 20:13:05] root INFO: rec_model_dir: None 
2021/12/26 20:13:05] root INFO: resume: False 
2021/12/26 20:13:05] root INFO: seed: 2048 
2021/12/26 20:13:05] root INFO: ser_model_type: LayoutXLM 
2021/12/26 20:13:05] root INFO: train_data_dir: None 
2021/12/26 20:13:05] root INFO: train_label_path: None 
2021/12/26 20:13:05] root INFO: warmup_steps: 0 
2021/12/26 20:13:05] root INFO: weight_decay: 0.0 
2021/12/26 20:13:05] root INFO: 
W1226 20:13:05.816488 1230 device_context.cc:447] Please NOTE: device: 0, GPUuUW 
sCompute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 
W1226 20:13:05.820412 1230 device_context.cc:465] device: 0, cuDNN Version: 7.6. 
Corrupt JPEG data: premature end of data segment 
2021/12/26 20:13:18] root INFO Eval]process: 0/7, loss: 1.41839 
2021/12/26 20:13:18] root INFO: [Eval]process: 1/7, loss: 1.60403 
2021/12/26 20:13:19] root INFO: [Eval]process: 2/7, loss: 1.70345 
2021/12/26 20:13:19] root INFO: [Eval]process: 3/7, loss: 1.60751 
2021/12/26 20:13:19] root INFO: [Eval]process: 4/7, loss: 1.49639 
2021/12/26 20:13:19] root INFO: [Eval]process: 5/7, loss: 1.66062 
2021/12/26 20:13:19] root INFO: [Eval]process: 6/7, loss: 1.56035 
2021/12/26 20:13:20] root INFO: 
precision recall f1-score support 
ANSWER 0.01 0.01 0.01 1514 
HEADER 0.00 0.00 0.00 58 
QUESTION 0.03 0.02 0.02 1155 
micro avg 0.02 0.01 0.01 2020 
macro avg 0.01 0.014 0.01 2727 
weighted avg 0.02 0.01 0.01 PAI | 
[2021/12/26 20:13:20] root INFO: ***** Eval results ***** 
[2021/12/26 20:13:20] root INFO: fi = 0.013078227173649792 
[2021/12/26 20:13:20] root INFO: loss = 1.5786780970437186 
[2021/12/26 20:13:20] root INFO: precision = 0.01925820256776034 
[2021/12/26 20:13:20] root INFO: recall = 0.009900990099009901 
[2021/12/26 20:13:20] root INFO: {'loss': 1.5786780970437186, 'precision': 0. 
301925820256776034, ‘'recall': 0.009900990099009901, '‘'fi': 0.013078227173649792} 
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Model Inference 


After training the model, you can also use the saved model to perform model inference on a single picture or an image in 
a folder, and observe the inference result of the model. 


! python infer_ser_e2e.py \ 
-—-model_name_or_path "./pretrained_models/PP-Layout_v1.0_ser_pretrained/" \ 
ser_model_type "LayoutXLM" \ 
-—-max_seq_length 512 \ 
——ojbicjswne (oluize Woybhejowley/srena eye //\ 
--infer_imgs "images/input/zh_val_42.jpg" 


During the inference process, the following log will be output 


process: [0/1], save result to output/ser_e2e/zh_val_42_ser.jpg 


8.3.4 Assignment 


Experiment 


https://aistudio. baidu.com/aistudio/projectdetail/328 1385 
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END-TO-END ALGORITHM 


9.1 Background 


OCR algorithms mainly consists of two algorithms: two-stage algorithms and end-to-end algorithms. Two-stage OCR 
algorithms include two tasks: text detection and recognition. The text detection algorithm is to locate text regions of the 
input image, and the text recognition algorithm is to recognize the text content in the image. Bascially, end-to-end OCR 
algorithms ami to integrate detection and recognition in a unified framework. The two parts share the same backbone 
network but have specialized modules for detection and recognition, so that they can be trained at the same time. The 
end-to-end algorithm simplifies the process, therefore the model is smaller and the processing speed is faster. 


In this section, We introduce some end-to-end text recognition methods based on deep learning in recent years. These 
approaches can be broadly classified into two categoriesl?] 


1PJEnd-to-end regular text recognition. 


2PEnd-to-end arbitrary-shaped text recognition. 


End-to-end regular text recognition algorithms mainly address the detection and recognition of horizontal or multi- 
directional texts. However, there are a large number of curved and distorted text in natural scenes, such as seals. In 
detecting and recognizing these texts, end-to-end arbitrary-shaped text recognition algorithms are needed. At the same 
time, these algorithms can also fit regular texts. 


This section has filtered out some representative end-to-end recognition methods from 2017 to 2021, and their classifi- 
cation according to the above two categories is shown in the table below. 


Table 1 End-to-end recognition methods 


Category Main papers 
End-to-end regular text recog- | FOTS, TextSpotter 
nition methods 
End-to-end arbitrary-shaped | Mask TextSpottervl!, Mask TextSpotter2, Mask TextSpotterv3, TextDragon, Char- 
text recognition methods Net, TUTS, ABCNet, ABCNetV2, Text Perceptron, PGNet, PAN++ 


Note: If there is any omission in the table, and if you have any questions, please contact us at link. 
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9.2 Algorithms 


9.2.1 End-to-end Regular Text Recognition Algorithms 


Xuebo Liu and Ding Liang et al. (2018) [1] propose a unified end-to-end trainable network FOTS (Fast Oriented Text 
Spotting) for simultaneous detection and recognition.Its network structure is shown in Figure 1. IT consists of four parts: 
shared convolutions, the text detection branch, ROJRotate operation and the text recognition branch. 


12Use shared convolutions to extract feature maps. By sharing convolutional features[the network can detect and rec- 
ognize text multaneously with little computation overhead, which leads to real-time speed; 


2P2lConstruct a text detection branch based on the fully convolutional network (FCN) after the feature maps are extracted, 
in order to predict the detection bounding boxes; 


32IThe RoIRotate operator extracts the oriented text regions from convolutional feature maps. This operation unifies text 
detection and recognition into an end-to-end pipeline. 


4(Finally, the text proposal features are fed into Recurrent Neural Network (RNN) encoder and Connectionist Temporal 
Classification (CTC) decoder for text recognition. 


Since all the modules in the network are differentiable, the whole system can be trained end-to-end, which does not require 
complicated post-processing and hyperparameter tuning. 


Text 


ed BBoxes — 
Proposal Text Predictedg oii 
Features, Recognition texts 
Branch 


Fig. 1 Overall structure of the FOTS model 


To validate the FOTS, the authors have conducted experiments on ICDAR 2015, ICDAR 2017 MLT and ICDAR 2013 
datasets and the visualised results are shown in Figure 2. 


(a) ICDAR 2015 (b) ICDAR 2017 MLT (c) ICDAR 2013 


Fig. 2 The result of FOST 
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Note:As ICDAR 2017 MLT does not involved the text recognition, only the text detection result is shown. 


Tong He et al.(2018)[2] propose a simple but effective model, TextSpotter[Ithe overall architecture is presented in Figure 
3. The detection model is a fully convolution architecture built on the PVAnet framework and introduce a new recurrent 
branch for word recognition. TextSpotter can process the detection and recognition in one shot. The RNN branch is 
composed of a new text-alignment layer and a LSTM-based recurrent module with a novel character attention embedding 
mechanism. This method develops a text-alignment layer by introducing a grid sampling scheme instead of conventional 
Rol pooling. It computes fixed-length convolutional features that precisely align to a detected text region of arbitrary 
orientation. Also, the character attention mechanism by using character spatial information as an additional supervision 
is introduced and make the RNN focus on current attentional features. Finally, TextSpotter adopts a learning strategy, 
this allows text detection and recognition to work collaboratively by sharing convolutional features. 


Detection 


onvolution Feature Attention 


backbone ali ' sequence guidance 


Fig. 3 The overall structure of the TextSpotter model 


To validate the TextSpotterflauthors have experimented on ICDAR2013 and ICDAR2015 datasets, and its visualized 
result is as follows in Figure 4. 
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Fig. 4 The result of TextSpotter 


9.2.2 End-to-end Arbitrary-shaped Text Recognition Algorithms 


Pengyuan Lyu and Minghui Liao et al. (2018) [3] propose Mask TextSpotter, an end-to-end trainable neural network 
model for scene text spottin. It can detect and recognize text instances (the right in Figure 5) in arbitrary shapes (horizontal, 
oriented, or curved) unlike some methods that can only detect and recognize horizontal texts (the left in Figure 5) or 
oriented texts (the middle in Figure 5) . 


mM 
sit] 


Fig. 5 The comparison of available text types in recognition 


Inspired by Mask R-CNN, Mask TextSpotter can detect text by segment the instance text regions. Thus detecor is able to 
detect text of arbitrary shapes. The network structure of Mask TextSpotter is shown in Figure 6. 
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Fig. 6 Overall structure of the Mask TextSpotter model 


To verify the effectiveness of the Mask TextSpotter, the authors have conducted experiments on ICDAR2013, IC- 
DAR2015 and Total-Text datasets, and the visualized results are as shown in Figure 7. 


Fig. 7 The results of Mask TextSpotter 


Minghui Liao et al. (2019) [4] propose Mask TextSpotter V2 based on Mask TextSpotter. Mask TextSpotter recognizes 
single characters, requires positions of characters during training, and post-processing algorithm is required to yeiled text 
squence. To overcome these limitations, Mask TextSpotter V2 introduces a spatial attention module (SAM). This method 
applies a Spatial Attention Module (SAM) for the recognition part, which can globally predict the label sequence of each 
word with a spatial attention mechanism. SAM only requires the word-level annotations for training, significantly reducing 
the need of character- level annotations for training. The network structure of Mask TextSpotter V2 is shown in Figure 
8. 
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Figure 8 Overall structure of the Mask TextSpotter V2 model 


To verify the effectiveness of the Mask TextSpotter V2, the authors have conducted experiments on five datasets, IC- 
DAR 2013, ICDAR 2015, COCO-Text, Total-Text, and MIT. The visualized results on the ICDAR 2013 (first column), 
ICDAR 2015 (second column) and Total-Text (last two columns) datasets are shown in Figure 9. 


Fra sina oe 


Figure 9 The result of Mask TextSpotter V2 


Minghui Liao et al. (2020) [5] also propose Mask TextSpotter V3 based on Mask TextSpotter V1 and V2. In the previous 
versions, text detection module is based on Mask R-CNN, and fails to detect long text lines due to the limitation of RPN. 
Instead of using RPN, the authors use Segmentation Proposal Network (SPN) in V3. It is therefore superior to RPN in 
detecting text instances of extreme aspect ratios or irregular shapes. The authors also propose hard RoI masking, which 
can effectively suppress neighboring text instances or background noise. The network structure of TextSpotter V3 is 
shown in Figure 10. 
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Fig. 10 Overall structure of the Mask TextSpotter V3 model 


276 Chapter 9. End-to-end algorithm 


Dive into OCR 


To verify the effectiveness of the Mask TextSpotter V3, the authors have conducted experiments on the Rotated ICDAR 
2013, Total-Text, and MSRA-TDS500 datasets. The visualized results on the Total-Text dataset are shown in Figure 11. 


Fig. 11 The results of Mask TextSpotter V3 


Wei Feng et al. (2019) [6] propose a novel text spotting framework, TextDragon, which only uses word/line-level annota- 
tions for training, and its network structure is shown in Figure 12. In TextDragon,a text detector is designed to describe 
the shape of text with a series of quadrangles, which can han- dle text of arbitrary shapes. In order to extract arbitrary text 
regions from feature maps, TextDragon has designed a new differentiable operator named RoJSlide, which is the key to 
connect ar- bitrary shaped text detection and recognition. Also, Based on the extracted features through RoISlide, a CNN 
and CTC based text recognizer is introduced to make the framework free from labeling the location of characters. The 
model becomes more useful for the framework only needs word/line level annotations instead of positions of annotated 


characters. 
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Fig. 12 Overall structure of the TextDragon model 
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To validate the effectiveness of the TextDragon, the authors have conducted experiments on two curved text benchmark 
datasets, CTW1500 and Total-Text, as well as ICDAR 2015 dataset. The visualized results are shown in Figure 13. 
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Fig. 13 The results of TextDragon 


Linjie Xing et al. (2019) [7] propose Convolutional Character Networks (CharNet) for end-to-end text detection and 
recognition, and the network structure is shown in Figure 14. CharNet introduces a new branch of direct character 
detection and recognition, which can be integrated into existing text detection framework. The authors also explore 
characters as basic unit, which overcome the limitations of the RoI pooling and the RNN recognition module, both 
of which are major limitations of current two-stage framework. In addition, an iterative character detection method is 
proposed in the paper, which allows CharNet to transform the character detection capabilities learned from synthetic data 
to real world images. This makes it possible to train CharNet on real world images without additional annotations of 
character-level bounding boxes. 
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Fig. 14 Overall structure of the CharNet model 


To verify the effectiveness of the CharNet method, the authors have conducted experiments on the benchmark datasets 
ICDAR 2015, Total-Text and ICDAR MLT 2017 and the visualized results are shown in Figure 15. 
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Fig. 15 The results of CharNet 


Siyang Qin et al. (2019) [8] propose an end-to-end trainable model TUTS and its network structure is shown in Figure 
16. TUTS is a simple and flexible OCR model based on a Mask R-CNN detector and a sequence-to-sequence(seq2seq) 
attention decoder. And this method can detecte and recognize text of arbitrary shape, and proposes a simple and effective 
RoI masking step, aiming to obtain useful irregularly shaped text instance features from image scale features. In addition, 
partially labeled data is automatically labeled by an existing multi-stage OCR engine, which can greatly optimize the 
model detection and recognition results. 
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Fig. 16 Overall structure of the TUTS model 


In order to verify the effectiveness of the TUTS method, the authors have conducted experiments on the benchmark 
dataset ICDAR2015 and Total-Text, and the visualized results are illustrated in Figure 17, where the two left columns 
are ICDAR2015 results and two right columns are results of Total-Text . 
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Fig. 17 The results of TUTS 


The blue parts in the bottom right image indicates the errors of the inference. 


Yuliang Liu et al. (2020) [9] propose an end-to-end trainable model Adaptive Bezier-Curve Network (ABCNet) for 
arbitrary-shaped texts, and its network structure is shown in Figure 18. For the first time, the authors a new concise 
parametric representation of curved scene text using Bezier curves. Also, they design a novel BezierAlign layer for 
extracting accurate convolutional features of a text instance with arbitrary shapes. The computation overhead of the 
proposed Bezier curve detection method is negligible compared with previous standard detection methods. And finally, 
by sharing backbone features, the recognition branch can be designed as a lightweight structure. The ABCNet consists of 
two parts: 1) Bezier curve detection; 2) BezierAlign and recognition branch, and the network structure has advantages in 
efficiency and accuracy. 
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Fig. 18 Overall structure of the ABCNet model 


To verify the effectiveness of the ABCNet method, the authors have conducted experiments on the arbitrary-shaped 
benchmark datasets Total-Text and CTW 1500, and the visualized results on Total-Text dataset are shown in Figure 19. 
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Fig. 19 The results of ABCNet 


Yuliang Liu et al. (2021) [10] propose ABCNet v2 based on ABCNet with improvements in four aspects: the feature ex- 
tractor, detection branch, recognition branch and the end-to-end training, which can achieve state-of-the-art performance 
while maintaining very high efficiency. The detection model of ABCNet v2 is more general for processing multi-scale 
text instances by considering bidirectional multi-scale pyramidal global text features. Observing that the feature align- 
ment in the detection branch is essential for the subsquent text recognition, the authors adopt a coordinate encoding 
ap- proach with negligible computation overhead to explicitly encode the position in the convolutional filters, leading to 
considerable improvement in accuracy. In the recognition branch, a character attention module is integrated which can 
recursively predict the characters of each word without using character-level annotations. Finally, to achieve effective 
end-to-end training, the authors further propose an Adaptive end-to-end training (AET) strategy to match the detection 
for end-to-end training. The network structure of ABCNet v2 is shown in Figure 20. 


7\\ Backbone -— Bezier Curve Detection Coord Conv 
BiFPN QB 
|| BezierAlign Light-weight 
a4 Recognition Head 
—_ 


Fig. 20 Overall structure of the ABCNet v2 model 


In order to validate the ABCNet v2 method, the authors have worked on ICDAR 2015, MSRA-TDS500, ReCTS, Total- 
Text, and SCUT-CTW1500, and some visualized results are shown in Figure 21. 
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Fig. 21 The results of ABCNet V2 
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Liang Qiao et al. (2020) [11] propose an end-to-end trainable arbitrary-shaped text recognition method, Text Perceptron, 
whose overall architecture is shown in Figure 22. Thisi model consists of three parts. 1) An efficient segmentation-based 
detection module that uses ResNet and FPN as the backbone network, which describes a text region as four subregions: the 
central region, head, tail, top&bottom boundary regions. The boundary information not only helps separate text regions 
that are very close to each other, but also contributes to capture latent reading-orders. 2) A novel Shape Transform Module 
(STM) module is designed to transform the detected feature regions into morphologies without extra parameters, and 
integrates irregular text detection and recognition into an end-to-end trainable model. 3) A sequence-based recognition 
module for generating final character sequences. 
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Fig. 22 Overall structure of the Text Perceptron model 


To validate the Text Perceptron method, the authors have conducted experiments on SynthText 800k , ICDAR2013, 
ICDAR2015, Total-Text and CTW1500. The visualized results on the Total-Text and CTW 1500 datasets are shown in 
Figure 23. 
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Fig. 23 The results of Text Perceptron 


Pengfei Wang et al. (2021) [12] propose a novel fully convolutional Point Gathering Network (PGNet), whose overall 
architecture is shown in Figure 24. The input image of the PGNet model is integrated into four branches after feature 
extraction: TBO (Text Border Offset), TCL (Text Center Line), TDO (Text Direction Offset), and TCC map(Text Char- 
acter Classification). The input of TBO and TCL can generate the text detection result after post-processing, and TCL, 
TDO, and TCC are for text recognition. The PGNet algorithm has exclusive advantages, including: designing PGNet loss 
to guide training without character-level annotations, speeding up inference without NMS and ROI operations, proposing 
a module to infer the reading order within text lines, and putting forward the graph refinement module (GRM) to improve 
the model recognition. This algorithm is more accurate and faster in inference. 
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Fig. 24 Overall structure of the PGNet model 


To verify the effectiveness of the PGNet method, the authors have conducted experiments on ICDAR2015 and Total-Text 
datasets and the visualized results are as shown in Figure 25, where the left two columns are the results of Total-Text and 
the right two columns are the results of ICDAR2015. 
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Fig. 25 The results of PGNet 


Wenhai Wang et al. (2021) [13] propose an end-to-end text recognition algorithm, PAN++, which can effectively detect 
and recognize arbitrary-shaped texts in natural scenes, whose overall architecture is shown in Figure 26. PAN++ is based 
on the kernel representation that reformulates a text line as a text kernel (central region) surrounded by peripheral pixels, 
and can well distinguish adjacent texts. Furthermore, as a pixel-based representation, the kernel representation can be 
predicted by a single fully convolutional network, which is very friendly to real-time applications. Taking the advantage of 
the kernel representation, the authors have designed a series of components: 1) a efficient feature enhancement network 
composed of stacked Feature Pyramid Enhancement Modules (FPEMs); 2) a lightweight detection head cooperating with 
Pixel Aggregation(PA); 3) an efficient attention-based recognition head with Masked Rol. 
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Fig. Overall structure of the PAN++ model 


To validate the PAN++ method, the authors have experimented on Total-Text, CTW 1500, ICDAR2015, MSRA-TD500, 
and RCTW-17 datasets, and the visualized results are shown in Figure 27. 
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Figure 27 The results of PAN++ 


9.3 Summary 


OCR text detection and recognition is a basic task of many applications such as text retrieval and office automation. Most 
of existing work takes the text detection and recognition as two separate tasks. There is no shared feature between the two 
tasks and training two models is time-consuming. In recent years, some methods are committed to end-to-end text recog- 
nition methods, which aims to detect and recognize text simultaneously in one network. We summarize some end-to-end 
text recognition methods based on deep learning, which can be roughly divided into two categories: 1) end-to-end regular 
text recognition and 2) end-to-end arbitrary shape text recognition. The common algorithms of the first category include 
FOTS and TextSpotter, which are mainly used for regular text detection and recognition. Those common algorithms of 
the latter category include Mask TextSpotterv1, Mask TextSpotter2, Mask TextSpotterv3, TextDragon, CharNet, TUTS, 
ABCNet, ABCNetV2, PGNet, PAN++ and so on, which are mainly used for arbitrary shape text detection and recogni- 
tion. 
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CHAPTER 
TEN 


PRE-PROCESSING ALGORITHM 


10.1 Background 


In OCR text detection and recognition, the quality of images directly affects the performance of detection and recognition. 
Low quality images often have problems such as tilt, fold, blur, virtual scene and so on. Therefor, image preprocessing is an 
important part for OCR. This section will focus on the common algorithms of data augmentation, image binarization and 
denoising in image preprocessing. Data augmentation is a commonly-used techniques in deep learning, which increases 
the number and diversity of samples by transforming training data to strengthen the generalization ability of the model. 
Image binarisation transforms the image into a black-white one, which separats the text from the background and is 
conducive to text detection and recognition. Denoising can remove noise interference (such as salt-and-pepper noise, 
Gaussian noise, etc.) in the image. It is necessary to recognize the type of noise, and then choose a suitable denoising 
method by considering the noise characteristics. Binarisation and denoising are preprocessing algorithms commonly used 
in traditional OCR, and have performed well on dealing with printed and scanned documents. They can also get clearer 
pictures after preprocessing algorithm, and then use the detection and recognition method based on deep learning. 


In this section, some common data augmentation, binarisation, and denoising methods are sceened, as shown in the 
following three tables. 


Table 1 Data augmentation methods 


Data Augmentation Methods 
Standard data augmenta- | Rotation, perspective transformation, blurring, Gaussian noise, random cropping and so 
tion on 


Image transformation AutoAugment, RandAugment, TimmAutoAugment 
Image cropping CutOut, RandErasing, HideAndSeek, GridMask 
Image mixture Mixup, Cutmix 


Table 2 Binarization methods 


Binarization Methods 

Global thresholding Fixed thresholding,Otus 

Local thresholding Adaptive thresholding, NiBlack, Sauvola, Bernsen 

Methods based on deep learning | U-Net, Grid LSTM, Full Convolutional Neural Networks etc. 


Table 3 Denoising methods 
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Denoising Methods 

Spatial domain filtering | Mean filtering, Gaussian filtering, median filtering, bilateral filtering, non-local means al- 
gorithm(NLM) 

Transform domain fil- | Fourier transform, wavelet transform 

tering 

BM3D BM3D 

Deep learning DnCNNs, FFDNet, MPRNet 


Note: If there is any omission in the table, and if you have any questions, please contact us at link. 


10.2 Data Augmentation 


OCR handwritten and scene text detection and recognition face many problems, such as different forms, complex back- 
ground, fuzzy text and so on. Therefore, training a robust recognition model requires a large amount of data to cover as 
many scenes as possible. Compared to data collection and annotation, data augmentation is a less costly way to improve 
the robustness of the model. This section focuses on some standard data augmentation methods, such as colour space 
transformation, blur, and noise. There are also many improved data augmentation strategies and some augmentation 
methods that new operations are inserted into different stages. We roughly divide these operations into four categories: 


¢« Common data augmentation methods: rotation, perspective transformation, blur, Gaussian noise, randCrop, etc.; 


¢ Image transformation: transform images after RandCrop. It mainly contains AutoAugment and RandAugment; 


e Image cropping — techniques: crop images after transposition. It mainly 
CutOut?RandErasing/?|HideAndSeek and GridMask; 


¢ Image mixing techniques: mix the data after batch processing. It mainly contains Mixu and Cutmix. 


First, import the modules and packages needed in the experiment. 


import numpy as np 

import cv2 

import random 

# When using matplotlib.pyplot in a notebook for drawing, you need to add thisu 
«command for display 

smatplotlib inline 

import matplotlib.pyplot as plt 


We read an image as a sample for the experiment, output the shape of the image, and display the image. 


img = cv2.imread('test.jpg') 

print (‘image shape: {}'.format (img.shape) ) 
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) 
plt.imshow (img_rgb) 

plt.show() 


contains 
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image shape:(90, 314, 3) 
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Fig. 1 Example of test picture 
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Most images generated by data augmentation methods are somewhat random. In order to facilitate the observation of the 
effect of data augmentation, a drawing function show_img is defined below. This function displays the input image img 


and the image new_img after data augmentation. 


def show_img(img, new_img): 
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) 


new_img_rgb = cv2.cvtColor(new_img, cv2.COLOR_BGR2RGB) 


jolie sblojhionc (il, 2, al) 
plt.imshow (img_rgb) 
juke wieabic Aken( Nepeatroyiiia.! }} 


pil subplot (dy2 2) 
plt.imshow (new_img_rgb) 


foulic geameaben(( erechorsinonenl!)) 


plt.show() 


10.2.1 Standard Data Augmentation 


12Colour space transformation (cvtColor): it transforms an image from one colour space to another. 


def flag(): 


moe 


flag 


mon 


return 1 if random.random() > 0.5000001 else -1 


def cvtColor(img): 


won 


CVvieCOlor 

none 

hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) 
delta = 0.001 * random.random() * flag() 
mswile, S, 2) = iewle, £6 2) » Cl a clea) 
new_img = cv2.cvtColor(hsv, cv2.COLOR_HSV2BGR) 


(continues on next page) 
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(continued from previous page) 
return new_img 


evtcolor_img = cvtColor (img) 
show_img(img, cvtcolor_img) 


origin transform 


0 0 
0 100 200 300 0 100 200 300 


Fig.2 Color space conversion contrast diagram 


2PBlur: The blurring effect is achieved by reducing the difference in pixel values of various points in the image. The 
Gaussian blur method can be used through cv2.GaussianBlur. And the parameters respectively represent the image 
array, the width and height of the kernel, and the standard deviation of the Gaussian kernel in the x-direction. The larger 
the value of the width and height of the kernel is, the blurrier the image will be. 


def blur(img): 


mon 


foplibie 
moe 
h, w, _ = img.shape 
Lf ho > 0 and w= 10: 
return cv2.GaussianBlur(img, (5, 5), 1) 
else: 
return img 
blur_img = blur (img) 
show_img(img, blur_img) 


origin : transform 
ATA ANA 
0 100 200 300 0 100 200 300 
Fig.3 blur process 


3 Jitter: The effect of jittering is achieved by randomly changing pixel values of the image. 
def jitter(img): 


jitter 


mon 


w, h, _ = img.shape 


(continues on next page) 
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(continued from previous page) 


Lf oh, SO and! ww > 0: 
thres = min(w, h) 
s = int (random.random() * thres * 0.01) 
src_img = img.copy () 
for i in range(s): 
ale; sks ake, i) = sie awe eu = sk, sla = ab, 2] 
return img 
else: 
return img 
jitter_img = jitter(img) 
show_img (img, jitter_img) 


origin transform 
0 0 
0 100 200 300 0 100 200 300 
Fig.4 Jitter process 


4[)| Noise: Noise is an unpredictable factor in real images, so adding noise to real data simulation is a simple and effective 
data augmentation method. The common methods include Gaussian noise, pretzel noise, etc. Here, Gaussian noise 
is traken as an exampel to show the process of adding noise to an image. mean represents the mean value and var 
represents the variance. The larger the variance is, the greater the noise will be. 


def add_gasuss_noise(image, mean=0, var=0.1): 
noe 


Gasuss noise 
moe 


noise = np.random.normal (mean, var**0.5, image.shape) 
out = image + 0.5 * noise 
Olle, = injeCullaje (emi, OW, 253) 
out = np.uint8 (out) 
return out 
noise_img = add_gasuss_noise (img) 
show_img(img, noise_img) 


origin transform 


50 


0 100 200 300 0 100 200 300 


Fig.5 Noise process 


(5) Random Crop: This method randomly selects a region from the image and crop it out to get a new sample. Considering 
that the height of the image is small in the text recognition, we set top_min and top_max to limit the cropping size. 
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def get_crop(image): 


mon 


random crop 


mon 


h, w, _ = image.shape 
top_min = 1 
top_max = 8 
Opec 
top_crop = min(top_crop, 
crop_img = image.copy () 
ratio = random.randint (0, 
LE aigattors 

crop_img = crop_img[top_crop:h, 
else: 

c 


rop = int (random. randint (top_min, 


LOpLAMG, = Cropaamg Othe tOpmeropy, 


return crop_img 


crop_img 
show_img(crop_img, img) 


= get_crop (img) 


origin 


100 


top_max) ) 


27 Sl 


transform 
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Fig.6 Crop process 


(6) Perspective transformation: It refers to the projection of an image to a new plane with a projection matrix. 


from warp_mls import WarpMLS 


def tia_perspective(src): 
, img_w = src.shape[:2] 


img_h 
thres 


SiRCmD 
dst_p 


Sep 
Srenp 
src_p 
src_p 


dst_p 
csitp 
dst_p 
dst_p 


Beans’ 
dst = 


i = sinenin / 7 2 


oe = alayshe (0) 

ts = list() 
ts.append([0, 0]) 
ts.append([img_w, 
ts.append([img_w, 
ts.append ( 
ts.append ( 
ts.append([img_w, 
ts.append([img_w, 
ts.append ( 

= WarpMLS (src, 
trans.generate() 


Ol, atinvef_ lot] )) 


0, np.random.randint (thresh) ] ) 
np.random.randint (thresh) ] ) 

img_h -— np.random.randint (thresh) }) 
0, img_h - np.random.randint (thresh) ]) 


img_w, img_h) 


(continues on next page) 
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(continued from previous page) 


return dst 
perspective_img = tia_perspective (img) 
show_img(img, perspective_img) 


(7) Colour Reverse: The method inverts the image colours by subtracting the original pixel value from the maximum value 
of grayscale level in the image. After the inversion, bright regions become darker and dark regions become brighter. 


def reverse (img): 
new_img = 255 img 


return new_img 
reverse_img = reverse(img) 
show_img (img, reverse_img) 


origin : transform 


0 


0 100 200 300 0 100 200 300 


Fig.8 Color Reverse Process 


1. TIA [1] is another effective data augmentation method, which first initializes a set of datum points in the image, 
and then randomly shifts these points through geometric transformation to generate a new image. 


Augmented Image 


Move control 
points 
following 
a certain 
distribution. 


Fig.9 TIA data augmentation 
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def tia_distort(src, segment=4): 


img_h 


cul. = 


, img_w = src.shape[:2] 


img_w // segment 


erase = icme /7/ 3} 


SIKCmD 
dst_p 


src_p 
src_p 
Ssrenp 


Step 


dst_p 
csitep 


dst_p 


dst_p 


tive Esteem 


ts = list() 

os = alaisie @) 

ts.append([0, 0]) 

ts.append([img_w, 0]) 

ts.append([img_w, img_h]) 

ts.append([0, img_h]) 

ts.append([np.random.randint (thresh), np.random.randint (thresh) ] ) 


ts.append ( 


[img_w - np.random.randint (thresh), np.random.randint (thresh) }) 


ts.append ( 


[img_w - np.random.randint (thresh), img_h - np.random.randint (thresh) ] ) 


ts.append ( 


[np.random.randint (thresh), img_h —- np.random.randint (thresh) }) 


thresh = thresh * 0.5 


for cut_idx in np.arange(1, segment, 1): 
Secupts append (cut, * icuberds, (0))) 
SeCupeEs append ([cut. * cubordsx,. meg) 
dst_pts.append([ 


] 


cut * cut_idx + np.random.randint (thresh) - half_thresh, 
np.random.randint (thresh) - half_thresh 
) 


dst_pts.append([ 


] 


trans 
dst = 


cut * cut_idx + np.random.randint (thresh) - half_thresh, 
img_h + np.random.randint (thresh) - half_thresh 
) 


= WarpMLS(src, src_pts, dst_pts, img_w, img_h) 
trans.generate() 


return dst 


distort_img = tia_distort (img) 
show_img(img, distort_img) 
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Fig. 10 TIA distort 
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def tia_stretch(src, segment=4 
img_h, img_w = src.shape[: 


cut = img_w // segment 
thresh = cut * 4 // 5 
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half_thresh = thresh * 0.5 


for cut_idx in np.arange(1l 
move = np.random.randi 
src_pts.append([cut * 
src_pts.append([cut * 
dst_pts.append([cut * 
dst_pts.append([cut * 


trans = WarpMLS(src, src_p 
dst = trans.generate() 


return dst 


stretch_img = tia_stretch (img) 
show_img(img, stretch_img) 
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Fig.11 TIA stretch 
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10.2.2 Image Transformation Techniques 


Transformation means performing some transformations on the image after RandCrop. It mainly contains[] 
e AutoAugment 
¢ RandAugment 
¢ TimmAutoAugment 


Unlike conventional artificially designed image augmentation methods, AutoAugment[2] is an image augmentation solu- 
tion suitable for a specific data set found by certain search algorithm in the search space of a series of image augmentation 
sub-strategies. For the ImageNet dataset, the final data augmentation solution contains 25 sub-strategy combinations. 
Each sub-strategy contains two transformations. For each image, a sub-strategy combination is randomly selected and 
then determined with a certain probability Perform each transformation in the sub-strategy. The ten images in Figure 12 
are used as test images to observe the data transformation. 


Fig. 12 Original test image 


The images after AutoAugment are as follows. 


Fig. 13 Diagram of AutoAugment data augmentation 


The search method of AutoAugment|[3] is relatively violent. Searching for the optimal strategy for this data set directly 
on the data set requires a lot of computation. In RandAugment, the author found that on the one hand, for larger models 
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and larger datasets, the gains generated by the augmentation method searched using AutoAugment are smaller. On the 
other hand, the searched strategy is limited to certain dataset, which has poor generalization performance and not sutable 
for other datasets. 


In RandAugment, the author proposes a random augmentation method. Instead of using a specific probability to determine 
whether to use a certain sub-strategy, all sub-strategies are selected with the same probability. The experiments in the 
paper also show that this method performs well even for large models. 


The images after RandAugment are as follows. 


== 


Fig. 14 Schematic of RandAugmentation data augmentation 


TimmAutoAugment is an improvement of AutoAugment and RandAugment by open source authors. Facts have proved 
that it has better performance on many visual tasks. At present, most VisionTransformer models are implemented based 
on TimmAutoAugment. 


10.2.3 Image Cropping Techniques 
Cropping means performing some transformations on the image after Transpose, setting pixels of the cropped area as 
certain constant. It mainly contains: 

¢ CutOut 

¢ RandErasing 

¢ HideAndSeek 

¢ GridMask 


Image cropping methods can be operated before or after normalization. The difference is that if we crop the image 
before normalization and fill the areas with 0, the cropped areas’ pixel values will not be 0 after normalization, which will 
cause grayscale distribution change of the data. The above-mentioned cropping transformation ideas are the similar, all 
to solve the problem of poor generalization ability of the trained model on occlusion images, the difference lies in that 
their cropping details. 


Cutout[4] is a kind of dropout, but occludes input image rather than feature map. It is more robust to noise than noise. 
Cutout has two advantages: 


(1) Using Cutout, we can simulate the situation when the subject is partially occluded. 


(2) It can promote the model to make full use of more content in the image for classification, and prevent the network 
from focusing only on the saliency area, thereby causing overfitting. The images after Cutout are as follows. 
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Fig. 15 Schematic of CutOut data augmentation 


RandomErasing[5] is similar to the Cutout. It is also to solve the problem of poor generalization ability of the trained 
model on images with occlusion. The author also pointed out in the paper that the way of random cropping is complemen- 
tary to random horizontal flipping. The author also verified the effectiveness of the method on pedestrian re-identification 
(REID). Unlike Cutout, in , RandomErasing is operateed on the image with a certain probability, size and aspect ratio 
of the generated mask are also randomly generated according to pre-defined hyperparameters.The images after Ran- 
domErasing are as follows. 


Fig. 16 Schematic of RandomErasing data augmentation 


Images are divided into some patches for HideAndSeek[6] and masks are generated with certain probability for each 
patch. The meaning of the masks in different areas is shown in the figure below. 
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Fig. 17 Diagram of HideAndSeek data augmentation 
GridMask[7] is to generate a mask with the same resolution as the original image and multiply it with the original image. 
The mask grid and size are adjusted by the hyperparameters.In the training process, there are two methods to use: 
¢ Set a probability p and use the GridMask to augment the image with probability p from the beginning of training. 


¢ Initially set the augmentation probability to 0, and the probability is increased with number of iterations from 0 to 
p. 


It shows that the second method is better. The images after GridMask are as follows. 


Fig. 18 Schematic of GridMask data augmentation 
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10.2.4 Image Mixing Techniques 


Mix means performing some transformations on the image after Batch, which contains: 
e Mixup 
¢ Cutmix 


Data augmentation methods introduced before are based on single image while mixing is carried on a certain batch to 
generate a new batch. 


Mixup[8] is the first solution for image aliasing, it is easy to realize and performs well not only on image classification but 
also on object detection. Mixup is usually carried out in a batch for simplification, so as Cutmix.The images after Mixup 
are as follows. 


Fig. 19 Schematic of Mixup data augmentation 


Unlike Mixup which directly adds two images, for Cutmix[9], an ROI is cut out from one image and Cutmix randomly 
cuts out an ROI from one image, and then covered onto the corresponding area in the another image.The images after 
Cutmix are as follows. 


Fig. 20 Schematic of Cutmix data augmentation 


300 Chapter 10. Pre-processing Algorithm 


Dive into OCR 


10.3 Image Binarization 


Image binarization refers to setting greyscale values of pixels to 0 or 255, so that the whole Image presents an obvious black 
and white effect. Each pixel in a binary image has only two values: 0 and 255, 0 represents black and 255 represents white. 
Image binarization can reduce the interference caused by noise, eliminate background interference and thus highlight the 
contours of the object. The OCR text recognition performance can be improved if the binarization can distinguish between 
the foreground and the background. We summarize the commonly used image binarization methods, which are mainly 
classified into global binarization, local binarization, and deep learning methods. 


10.3.1 Global Thresholding 


The global thresholding refers to the global processing of all pixel in an image with the same threshold value. The methods 
include fixed thresholding, Otus, etc. 


The fixed threshold method uses a fixed value for all pixels as the global threshold T. If the pixel value of the current 
pixel point is greater than or equal to the threshold T, then the point is assigned a value of 255, otherwise it is assigned 
a value of 0. It is usually necessary to set different thresholds for experimental observation of the binarization effect, 
and it is difficult to determine the optimal threshold for different images. The different binarization effects of different 
thresholds are shown in the Figure 21. To overcome this problem, a Japanese scholar Nobuyuki Otsu proposed an 
adaptive thresholding approach, Otsu[10], in 1979. Otsu divides the image into foreground and background. The larger 
the variance between pixels is, the lower the correlation is, and the foreground and the background are more distinct from 
each other. Therefore, the variance of pixels between the gray value of each pixel of the image is calculated to find the 
gray value with the maximum variance, which is the binarization threshold. 
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Fig.21 Effects of binarization with different thresholds 


10.3.2 Local Thresholding 


If the image has some problems such as uneven illumination, the binarization methods based on global threshold will not 
achieve the expected binarization effect, so in this case the binarization method based on local threshold is more suitable. 
Common local thresholding methods for image binarization include adaptive thresholding algorithm [11], NiBlack [12], 
Sauvola [13] and so on. 


In the adaptive thresholding algorithm, there is a sliding window with the size of s « s centred on a pixel, and it sweeps 
across the whole image. In each silde, the pixels in the current window will be averaged and the average value will be used 
as the local threshold 7. If the value of a pixel in the current window is less than the local threshold T’/100, the pixel is 
assigned a value of 0; if it is greater than the local threshold T’/100, it is assigned a value of 255, as shown in Figure 22. 
NiBlack calculates the mean m and the variance s of the pixels in the local region of the image, then calculates the local 
threshold throught the formula T’ = m + k * s, k represents the correction factor with a value between 0 and 1. Finally, 
binarization is performed according to the threshold T. Sauvola is improved from the NiBlack algorithm. It calculates the 
local threshold through T = m — [1 +k —( fracsR — 1)], R represents the dynamic range of the variance, and if the 
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current input image is an 8-bit grey-scale image, R = 128. Sauvola performs better than NiBlack in situations such as the 
uneven illumination. 
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Fig.22 The effect of adaptive binarization 


10.3.3 Techniques Based on Deep Learning 


It is difficult for traditional global and local binarization methods to set appropriate thresholds, which results in poor 
image binarization effect. With the continuous development of deep learning, some scholars have begun to try to use 
neural networks to binarize images. 


Pratikakis I et al [14] has used U-Net convolutional neural network architecture for document image binarization in 
ICDAR2017 DIBCO competition and won the championship. Chris Tensmeyer et al [15] use a multi-scale full convo- 
lutional neural network to binarize document images and the results are shown in Figure 23.Vo QN et al [16] propose a 
hierarchical deep supervised network?]DSNfl for document binarization and the results are shown in Figure 24. Westphal 
Fet al [17] use a Grid Long Short Term Memory (Grid LSTM) network for binarization. However, its performance was 
lower than that of the method [16]. Calvo-Zaragoza J et al [18] use deep encoder and decoder architecture to achieve 
binarization. 


(a)original pciture (b) FCN binarization 


Fig.23 The binarization effect of the full convolutional neural network 
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Fig.24 The binarization effect of hierarchical DSN 


10.4 Denoising 


Image noise refers to unnecessary or redundant interference information existing in image data. It may seriously affect 
the data quality, so it is often necessary to process the noise. In addition, we need to maintain the details of the images 
removing noise. The denoising methods will be categorized into four groups: spatial domain filtering, transform domain 
filtering, non-local filtering, and methods based on neural network. 


10.4.1 Spatial Domain Filtering 


Spatial domain filtering refers to the processing of pixel values by performing data operations directly on the original 
image. Common spatial domain filtering denoising algorithms include mean filter, Gaussian filter, median filter, bilateral 
filter [19], Non-Local Means (NLM) algorithm [20], etc. 


The mean filter uses the average pixel value in the field of pixel point A to replace the original pixel value of pixel point 
A. It is a typical smoothing linear filter. The mean filter is simple to calculate and plays a smoothing role in the whole 
image. However, it cannot preserve image details, which makes the image blurred. Gaussian filter is also a linear filter 
and a commonly used filtering algorithm. After Gaussian filter, the value of each image pixel is replaced by a weighted 
average value of itself and other pixels values in the field. Compared to the mean filter, the Gaussian filter performs better 
on smoothing and can better preserve the edge information, and suppress Gaussian noise. 


The median filter replaces image pixel value with the median of the neighbourhood grey value of target pixel. It is a 
non-linear filter, suitable for dealing with salt-and-pepper noise and preserving the image edge details. The bilateral filter 
takes into account not only the spatial distance of pixels, but also the similarity between pixels, and the colour intensity. 


Four filters are implemented with the Opencv library. 


noise_img = cv2.imread('preprocess/noise.png') 
# Mean filter 

img_mean = cv2.blur(noise_img, (5,5)) 
show_img(noise_img, img_mean) 

# Gaussian filter 


(continues on next page) 
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(continued from previous page) 


img_gussian = cv2.GaussianBlur(noise_img, (5,5), 1) 
show_img(noise_img, img_gussian) 

# Median filter 

img_median = cv2.medianBlur(noise_img, 5) 
show_img(noise_img, img_median) 

# Bilateral filter 

img_bilater = cv2.bilateralFilter(noise_img, 3, 15, 15) 
show_img(noise_img, img_bilater) 
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The filters based on the neighbourhood pixels basically only considers the gray value information of pixels within the 
sliding window, without considering the statistical information of pixels within the window, such as variance, the pixel 
distribution characteristics of the whole image, the pixel distribution characteristics of the whole image, and the prior 
knowledge of noise. In view of these limitations, NLM algorithm has been proposed, which uses redundant information 
in the image to remove the noise and retain the maximum detailed features in it. NLM method uses the whole image 
for denoising by finding similar regions in the image in blocks, and calculating the weighted average of these regions to 
get the denoised image. The similarity is calculated with the weighted Euclidean distance, and it is in positive correlation 
with the weight. 


10.5 Transform Domain Filtering 


Transform domain filtering methods convert the image from the original spatial domain to transform domain; In the 
transform domain, the noise can be divided into high, medium and low frequency noises, and the different frequency 
noise can be separated by the transform domain method; and then the image is converted from transform domain to 
original space domain by inverse transformation, so as to remove the image noise. There are many ways for conversion, 
including Fourier transform, wavelet transform algorithms and so on. 


Fourier transform converts the input image from the spatial domain to the frequency domain, containing both low and 
high frequency information. The points on the image where the grey values change rapidly are often the high frequency 
noise in the image. Then a low-pass filter with Fourier transform removes the high-frequency component of the image 
and only allows the low-frequency information to pass through the filter, so as to achieve the purpose of removing the 
image noise. 


Wavelet transform denoising can be carried out in three steps: 


1P|Wavelet decomposition of the image: selects a wavelet and the number of layers N of wavelet decomposition and 
applied N layers of wavelet decomposition to the signal s. 


2PINon-linear threshold quantization of wavelet transform coefficients: selects a threshold for each layer from 1 to N and 
the high-frequency coefficient of that layer is quantized and the low-frequency coefficient of each layer is not processed. 


32|Wavelet coefficient reconstruction: Based on the low frequency coefficient of the Nth layer and the high frequency 
coefficients of the 1~N layers after processing, the wavelet reconstruction of the original signal is processed. The key to 
wavelet transform denoising lies in the threshold value. The threshold function can be divided into the hard threshold 
function and the soft threshold function. 


10.5.1 BM3D 


BM3D (Block-matching and 3D filtering) [21] combines spatial domain filtering and transform domain filtering tech- 
niques.Firstly it uses the method of computing similar blocks in the NLM, and then integrates wavelet transform de- 
noising method. The algorithm finds similar blocks by similarity determination and combines them into 3D groups. 
The 3D groups are then transformed into the wavelet domain, where hard thresholding or Wiener filtering is used to 
reduce the noise. Finally, an inverse transformation process is performed to aggregate all the image blocks to obtain the 
noise-reduced image. 
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10.5.2 Methods based on Deep Learning 


With the development of deep learning, denoising algorithms based on deep learning also keep emerging, and CNN-based 
denoising methods have improved the denoising effect. Commonly methods based on deep learning include DnCNNs[22], 
FFDNet[23], MPRNet[24] and so on. 


The DnCNNs (Denoising CNNs) introduces residual learning and batch normalisation into image denoising for the first 
time, and the combination of the two can enhance each other, and effectively improve training speed and denoising per- 
formance. FFDNet(Fast and Flexible Denoising NetWork) works on downsampled sub-images, achieving a good balance 
between the training and inference speed ,and the enlarging the receptive field. Also, it adopts orthogonal regularization 
to enhance generalization. What’s more, MPRNet first learns the contextualized features using encoder-decoder archi- 
tectures with a large receptive field. And then combines them with a high-resolution branch that retains local information 
needed in the image. MPRNet can be applied in multiple scenarios such as rain removal, deblurring, denoising and so 
on. 


10.6 Summary 


This section focuses on common methods of data augmentation, binarization and denoising in image pre-processing. First, 
in order to improve the robustness of the model, data augmentation is usually used on the training samples. Four kinds 
of data augmentation techniques are introduced here: 


1P\Standard data augmentation techniques: rotation, perspective transformation, blurring, Gaussian noise, random crop- 
ping, etc; 

22] Image transformation techniques: transform images after RandCrop. It mainly contains AutoAugment and RandAug- 
ment; 


3PlImage cropping techniques: crop images after transposition and set the pixel values of the cropped region to a particular 
constant (the default value is 0). It mainly contains CutOut/RandErasing[|HideAndSeek and GridMask;; 


4PlImage mixing techniques: mix the data after batch processing. It mainly contains Mixu and Cutmix. 


Secondly, high quality binary images can effectively improve text recognition performance. Global thresholding (Fixed 
Thresholding, Otus), local thresholding (Adaptive Thresholding, NiBlack, Sauvola, Bernsen) and methods based on deep 
learning (U-Net, Grid LSTM, Full Convolutional Neural Network, etc.) are mainly introduced. 


Thirdly, in the era of big data image quality is uneven, image filtering is increasingly used and more and more demanding. 
So this section also briefly introduce some image denoising methods, including spatial domain filtering, transform domain 
filtering, BM3D and filtering based on deep learning. 
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CHAPTER 
ELEVEN 


DATA SYNTHESIS ALGORITHMS 


11.1 Background 


The performance of deep learning methods is closely related to the quality and quantity of training data, but the acquisition 
of massive annotated data has become a bottleneck for efficient development of many deep learning tasks. Images of 
different scenes in OCR tasks have unique styles in fonts, blurs, lighting, and so on. Collecting enough data is challenging 
and costly, and manual annotation is also time-consuming and error-prone. Therefore, in the absence of access to sufficient 
annotated data, automatic synthesis of annotated images is gaining more and more attention, for it successfully alleviating 
the shortage of data and the problem of manual data annotation. In the case of OCR, for example, synthesised data is 
important in training OCR text detection and recognition models and has proved effective in numerous algorithms. In 
this section, some representative data synthesis methods in recent years will be presented, such as SynthText, Verisimilar, 
SynthText3D,SF-GAN, SRNet, ScrabbleGAN, UnrealText. 


11.2 Data Synthesis Algorithms 


Ankush Gupta et al. (2016)[1] propose SynthText, a new method to synthesize text images by overlaying synthetic text 
onto background images to generate text images in natural scenes. The synthetic process consists of five steps: 1) collect 
a large number of text-free background images, fonts and text copora; 2) the image is segmented into contiguous regions 
based on local colour and texture cues, and use CNN to obtain a dense pixel-wise depth map; 3) get the candidate region 
according to semantic and depth information;(4) choose the colour of text and its outline (optional) according to the colour 
of the candidate region; 5) render the text with a randomly selected font, and the rendered image is shown in Figure 1. 
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Figure | The synthetic image using SynthText 


Fangneng Zhan et al. (2018) [2] propose another novel image synthesis technique, Verisimilar, that aims to generate 
massive annotated scene text images for training accurate and robust scene text detection and recognition models. The 
authors combine semantic segmentation and visual saliency in text synthesis, introducing semantic segmentation to make 
the text appear only on a perceptible object. For example, scene texts tend to appear over the wall or table surface instead 
of the food or leaves. It uses visual saliency to determine the embedding locations of the text, and distinguish the text 
and the background. Finally, It designs a novel scene text appearance model that determines the color and brightness of 
source texts by learning from the feature of real scene text images adaptively. The method goes well in training accurate 
and robust scene text detection and recognition models, and realizes the synthesis of verisimilar scene text images. And 
the results are shown in Figure 2. 
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Figure 2 The results of Verisimilar 


Minghui Liao et al. (2019) [3] propose SynthText3D, a model of synthesizing scene text images in 3D virtual words. In 
this way, real-world variations, including complex perspective trans- formations, various illuminations, and occlusions, 
can be realized in our synthesized scene text images. Specifically, text instances in various fonts are firstly embedded 
into suitable positions in a 3D virtual worlds. The synthetic images produced from 3D virtual worlds yield fantastic 
visual effects, including various illuminations, and occlusions, and text and the background scene are rendered together. 
Finally, the authors set up camera with different locations and orientations to produce images of the same text of different 
viewpoints. The synthesis results are shown in Figure 3. 
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(a) Various illuminations and visibility of the same text instances. 


(b) Different viewpoints of the same text instances. 


(c) Different occlusion cases of the same text instances. 


Figure 3 The 3D synthetic images with SynthText 


Fangneng Zhan et al. (2019) [4] propose the Spatial Fusion GAN (SF-GAN) that combines a geometry synthesizer and 
an appearance synthesizer to generate images that are approximately realistic in both geometry and appearance spaces. 
The geometry synthesizer learns contextual geometries of background images and and transforms and places foreground 
objects into the background images unanimously. The appearance synthesizer adjusts the color, brightness and style of the 
foreground objects. The authors also design a fusion network which introduces detail-preserving guide filters to preserve 
realistic appearance. The synthetic results are shown in Figure 4. 
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Figure 4 Synthetic results of SF-GAN 
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Liang Wu et al. (2019) [5] propose an end-to-end trainable network: style retention network (SRNet) that changes the 
content of the source image into target text while keeping the original image style. The framework of Style-Text consists 
of three modules: 1) the style migration module of the text foreground, 2) the background extraction module, and 3) 
the fusion module. The first module replaces the text content of the source image with the target text, and preserve the 
original text region at the same time. The second module erases the original text and fills the text region with appropriate 
texture. The fusion module combines the information from the two former modules, and generates the edited text images. 
After these three steps, the image text style can be quickly migrated. The results are shown in Figure 5. 
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Figure 5 The synthetic results of SRNet 


Sharon Fogel et al. (2020) [6] propose a semi-supervised approach, ScrabbleGAN, to synthesize handwritten text images. 
A semi-supervised approach can also use unlabelled data to train a handwritten text synthesis framework besides labeled 
data. ScrabbleGAN relies on a new fully convolutional generator model that can generate images with arbitrarily long 
words or even complete sentences. In addition, ScrabbleGAN’s generator can control the style of the generated text. For 
example, it allows us to change whether the text is cursive or how thin is the penstroke. The results are shown in Figure 6. 
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Shangbang Long et al. (2020) [7] propose another effective image synthesis method, UnrealText, to synthetic scene text 
images from 3D virtual world.UnrealText is based on the famous Unreal Engine 4 (UE4), and is therefore named as 
UnrealText. Specifically, text instances are regarded as planar polygon meshes with text foregrounds loaded as texture, 
and they are placed in suitable positions in 3D world.And the text and the scene are rendered together as whole, achieving 
realistic visual effects, e.g. illumiination, occlusion, and perspective transformation. The results are shown in Figure 7. 
The UnrealText engine achieves realistic rendering and high scalability, significantly improves the performance of text 
detectors and model generation. Also, a large-scale multilingual scene text dataset is also constructed, which is helpful 
for further research. 
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Figure 7 The synthetic results of UnrealText 
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11.3 summary 


Public datasets in the text detection and recognition may not meet the demand of current application scenarios or small 
number of datasets. What’s more, manual annotation of data is time-consuming and labour-intensive, so OCR data 
synthesis has becomed a common practice. In this section, we have summarized some data synthesis methods, including 
SynthText, SynthText3D, SF-GAN, SRNet and do on. OCR data is generated through these methods[|which improves 
the performance of text detection and recognition. 
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